Categories
Agility

XML, Latin, and the demise or endurance of languages

We are living in a period of great fluctuation and uncertainty.  In nearly every domain — whether politics, business, technology, or health policy — people are asking what is the foundation upon which the future will be built.  Even the very currency of language doesn’t seem solid.  We don’t know if everyone agrees what concepts mean anymore or what’s considered the source of truth.

Language provides a set of rules and terms that allow us to exchange information.  We can debate if the rules and terms are good ones — supporting expression.  But even more important is whether other groups understand how to use these rules and terms.  Ubiquity is more important than expressiveness because a rich language is not very useful if few people can understand it.

I used to live in Rome, the Eternal City.  When I walked around, I encountered Latin everywhere: it is carved on ancient ruins and Renaissance churches.  No one speaks Latin today, of course. Latin is a dead language.  Yet there’s also no escaping its legacy.  Latin was ubiquitous and is still found in scattered around in many places, even though hardly anyone understands it today.  Widely used languages such as Latin may die off over time but they don’t suddenly disappear.  Slogans in Latin still appear on our public buildings and monetary currency.  

I want to speculate about the future of the XML markup language and the extent to which it will be eternal.  It’s a topic that elicits diverging opinions, depending on where one sits.  XML is the foundation of several standards advocated by certain content professionals.  And XML is undergoing a transition: it’s lost popularity but is still present in many areas of content. What will be the future role of XML for everyday online content?  

In the past, discussions about XML could spark heated debates between its supporters and detractors.  A dozen years ago, for example, the web world debated the XHTML-2 proposal to make HTML compliant with XML. Because of its past divisiveness, discussions comparing XML to alternatives can still trigger defensiveness and wariness among some even now. But for most people, the role of XML today is not a major concern, apart from a small number of partisans who use XML either willingly or unwillingly.  Past debates about whether XML-based approaches are superior or inferior to alternatives are largely academic at this point. For the majority of people who work with web content, XML seems exotic: like a parallel universe that uses an unfamiliar language.   

Though only a minority of content professionals focus on XML now, everyone who deals with content structure should understand where XML is heading. XML continues to have an impact on many things in the background of content, including ways of thinking about content that are both good and bad.   It exerts a silent influence over how we think about content, even for those who don’t actively use it. The differences between XML and its alternatives are rarely directly discussed much now, having been driven under the surface, out of view — a tacit truce to “agree to disagree” and ignore alternatives.  That’s unfortunate, because it results in silos of viewpoints about content that are mutually contradictory.  I don’t believe choices about the structural languages that define communications should be matters of personal preferences, because many consequences result from these choices that affect all kinds of stakeholders in the near and long term.  Language, ultimately, is about being able to construct a common meaning between different parties — something content folks should care about deeply, whatever their starting views.

XML today

Like Latin, XML has experienced growth and decline.  

XML started out promising to provide a universal language for the exchange of content.  It succeeded in its early days in becoming the standard for defining many kinds of content, some of which are still widely used.  A notable example is the Android platform, first released in 2008, which uses XML for screen layouts. But XML never succeeded in conquering the world by defining all content.  Despite impressive early momentum, XML for the past decade seems to be less important each passing year.  Android’s screen layout was arguably the last major XML-defined initiative.  

A small example is XML’s demise of RSS feeds.  RSS was one of the first XML formats for content and was instrumental in the expansion of the first wave of blogging.  However, over time, fewer and fewer blogs and websites actively promoted RSS feeds.  RSS is still used widely but has been eclipsed by other ways of distributing content.  Personally, I’m sorry to see RSS’s decline.  But I am powerless to change that.  Individuals must adapt to collectively-driven decisions surrounding language use.    

By 2010, XML could no longer credibly claim to be the future of content.  Web developers were rejecting XML on multiple fronts:

  • Interactive websites, using an approach then referred to as AJAX (the X standing for XML), stopped relying on XML and started using the more web-friendly data format known as JSON, designed to work with Javascript, the most popular web programming language. 
  • The newly-released HTML5 standard rejected XML compatibility.  
  • The RESTful API standard for content exchange started to take off, which embraced JSON over XML.  

Around the same time, web content creators were getting more vocal about “the authoring experience” — criticizing technically cumbersome UIs and demanding more writer-friendly authoring environments.  Many web writers, who generally weren’t technical writers or developers, found XML’s approach difficult to understand and use.  They preferred simpler options such as WordPress and Markdown.  This shift was part of a wider trend where employees expect their enterprise applications to be as easy to use as their consumer apps. 

The momentum pushing XML into a steady decline had started.  It retreated from being a mainstream approach to becoming one used to support specialized tasks.  Its supporters maintained that while it may not be the only solution, it was still the superior one.  They hoped that eventually the rest of the world would recognize the unique value of what XML offered and adopt it, scrambling to play catch up.  

That faith in XML’s superiority continues among some.  At the Lavacon content strategy conference this year, I continued to hear speakers, who may have worked with XML for their entire careers, refer to XML as the basis of “intelligent content.”  Among people who work with XML, a common refrain is that XML makes content future-ready.  These characterizations imply that if you want to be smarter with content and make it machine-ready, it needs to be in XML.  The myth that XML is the foundation of the future has been around since its earliest days.  Take the now-obscure AI markup language, AIML, created in 2001, which was an attempt to encode “AI” in XML.  It ended up being one of many zombie XML standards that weren’t robust enough for modern implementations and thus weren’t widely used.  Given trends in XML usage, it seems likely that other less mainstream XML-centric standards and approaches will face a similar fate.  XML is not intrinsically superior to other approaches.  It is simply different, having both strengths and weaknesses.  Teleological explanations  — implying a grand historical purpose — tend to stress the synergies between various XML standards and tools that provide complementary building blocks supporting the future. Yet they can fail to consider the many factors that influence the adoption of specific languages.  

The AIML example highlights an important truth about formal IT languages: simply declaring them as a standard and as open-source does not mean the world is interested in using them.  XML-based languages are often promoted as standards, but their adoption is often quite limited.  De facto standards — ones that evolve through wide adoption rather than committee decisions — are often more important than “official” standards.  

What some content professionals who advocate XML seem to under-appreciate is how radically developments in web technologies have transformed the foundations of content.  XML became the language of choice for an earlier era in IT when big enterprise systems built in Java dominated.  XML became embedded in these systems and seemed to be at the center of everything.  But the era of big systems was different from today’s.  Big systems didn’t need to talk to each other often: they tried to manage everything themselves.  

The rise of the cloud (specifically, RESTful APIs) disrupted the era of big systems and precipitated their decline.  No longer were a few systems trying to manage everything.  Lots of systems were handling many activities in a decentralized manner.  Content needed to be able to talk easily to other systems. It needed to be broken down into small nuggets that could be quickly exchanged via an API.   XML wasn’t designed to be cloud-friendly, and it has struggled to adapt to the new paradigm. RESTful APIs depend on easy, reliable and fast data exchanges,” something XML can’t offer. 

A few Lavacon speakers candidly acknowledged the feeling that the XML content world is getting left behind.  The broader organization in which they are employed  — marketing, developers, and writers — aren’t buying into the vision of an XML-centric universe.  

And the facts bear out the increasing marginalization of XML.  According to a study last year by Akamai, 83% of web traffic today is APIs and only 17% is browsers.  This reflects the rise of smartphones and other new devices and channels.  Of APIs, 69% use the JSON format, with HTML a distant second. “JSON traffic currently accounts for four times as much traffic as HTML.” And what about XML?   “XML traffic from applications has almost disappeared since 2014.”  XML is becoming invisible as a language to describe content on the internet.

Even those who love working with XML must have asked themselves: What happened?  Twenty years ago, XML was heralded as the future of the web.  To point out the limitations of XML today does not imply XML is not valuable.  At the same time, it is productive to reality-check triumphalist narratives of XML, which linger long after its eclipse.  Memes can have a long shelf life, detached from current realities.  

XML has not fallen out of favor because of any marketing failure or political power play.  Broader forces are at work. One way we can understand why XML has failed, and how it may survive, is by looking at the history of Latin.

Latin’s journey from universal language to a specialized vocabulary

Latin was once one of the world’s most widely-used languages.  At its height, it was spoken by people from northern Africa and western Asia to northern Europe.

The growth and decline of Latin provides insights into how languages, including IT-flavored ones such as XML, succeed and fail.  The success of a language depends on expressiveness and ubiquity.

Latin is a natural language that evolved over time, in contrast to XML, which is a formal language intentionally created to be unambiguous.  Both express ideas, but a natural language is more adaptive to changing needs.  Latin has a long history, transforming in numerous ways over the centuries.

In Latin’s early days during the Roman Republic, it was a widely-spoken vernacular language, but it wasn’t especially expressive.  If you wanted to write or talk about scientific concepts, you still needed to use Greek.  Eventually, Latin developed the words necessary to talk about scientific concepts, and the use of Greek by Romans diminished.  

The collapse of the Roman Empire corresponded to Latin’s decline as a widely-spoken vernacular language.  Latin was never truly monolithic, but without an empire imposing its use, the language fragmented into many different variations, or else was jettisoned altogether.  

In the Middle Ages, the Church had a monopoly on learning, ensuring that Latin continued to be important, even though it was not any person’s “native” language.  Latin had become a specialized language used for clerical and liturgical purposes.  The language itself changed, becoming more “scholastic” and more narrow in expression. 

By the Renaissance, Latin morphed into being a written language that wasn’t generally spoken. Although Latin’s overall influence on Europeans was still diminishing, it experienced a modest revival because legacy writings in Latin were being rediscovered.  It was important to understand Latin to uncover knowledge from the past — at least until that knowledge was translated into vernacular languages.  It was decidedly “unvernacular”: a rigid language of exchange.  Erasmus wrote in Latin because he wanted to reach readers in other countries, and using Latin was the best means to do that, even if the audience was small.  A letter written in Latin could be read by an educated person in Spain or Holland, even if those people would normally speak Spanish or Dutch.   Yet Galileo wrote in Italian, not Latin, because his patrons didn’t understand Latin.  Latin was an elite language, and over time size of the elite who knew Latin became smaller.

Latin ultimately died because it could not adapt to changes in the concepts that people needed to express, especially concerning new discoveries, ideas, and innovations.

Latin has transitioned from being a complete language to becoming a controlled vocabulary.  Latin terms may be understood by doctors, lawyers, or botanists, but even these groups are being urged to use plain English to communicate with the public.  Only in communications among themselves do they use Latin terms, which can be less ambiguous than colloquial ones. 

Latin left an enduring legacy we rarely think about. It gave us the alphabet we use, allowing us to write text in most European languages as well as many other non-European ones.  

XML’s future

Much as the collapse of the Roman Empire triggered the slow decline of Latin, the disruption of big IT systems by APIs has triggered the long term decline of XML.  But XML won’t disappear suddenly, and it may even change shape as it tries to find niche roles in a cloud-dominated world.  

Robert Glushko’s book, The Discipline of Organizing, states: “‘The XML World’ would be another appropriate name for the document-processing world.”  XML is tightly fused to the concept of documents — which are increasingly irrelevant artifacts on the internet.  

The internet has been gradually and steadily killing off the ill-conceived concept of “online documents.”  People increasingly encounter and absorb screens that are dynamically assembled from data.  The content we read and interact with is composed of data. Very often there’s no tangible written document that provides the foundation for what people see.  People are seeing ghosts of documents: they are phantom objects on the web. Since few online readers understand how web screens are assembled, they project ideas about what they are seeing.  They tell themselves they are seeing “pages.” Or they reify online content as PDFs.  But these concepts are increasingly irrelevant to how people actually use digital content.  Like many physical things that have become virtual, the “online document” doesn’t really resemble the paper one.  Online documents are an unrecognized case of skeuomorphism.

None of this is to say that traditional documents are dead.  XML will maintain an important role in creating documents. What’s significant is that documents are returning to their roots: the medium of print (or equivalent offline digital formats).  XML originally was developed to solve desktop publishing problems.  Microsoft’s Word and PowerPoint formats are XML-compatible, as is Adobe’s PDF format. Both these firms are trying to make these “office” products escape the gravity-weight of the document and become more data-like.  But documents have never fit comfortability in an interactive, online world.  People often confuse the concepts of “digital” and “online”.  Everything online is digital, but not everything digital is online or meant to be.  A Word document is not fun to read online.  Most documents aren’t.  Think about the 20-page terms and conditions document you are asked to agree to.  

A document is a special kind of content.  It’s a highly ordered large-sized content item.  Documents are linear, with a defined start and finish.  A book, for example, starts with a title page, provides a table of contents, and ends with an appendix and index  Documents are offline artifacts.  They are records that are meant to be enduring and not change. Most online content, however, is impermanent and needs to change frequently. As content online has become increasingly dynamic, the need for maintaining consistent order has lessened as well.  Online content is accessed non-linearly.  

XML promoted a false hope that the same content could be presented equally well both online and offline — specifically, in print.  But publishers have concluded that print and online are fundamentally different.  They can’t be equal priorities.  Either one or the other will end up driving the whole process.  For example, The Wall Street Journal, which has an older subscriber base, has given enormous attention to their print edition, even as other newspapers have de-emphasized or even dropped theirs.  In a review of their operations this past summer, The Journal found that their editorial processes were dominated by print because print is different.  Decisions about content are based on the layout needs of print, such as content length, article and image placement, as well as the differences in delivering a whole edition versus delivering a single article.  Print has been hindering the Journal’s online presence because it’s not possible to deliver the same content to print and screen as equally important experiences.  As result, the Journal is contemplating de-emphasizing print, making it follow online decisions, rather than compete with them.

Some publishers have no choice but to create printable content.  XML will still enjoy a role in industrial-scale desktop publishing.  Pharmaceutical companies, for example, need to print labels and leaflets explaining their drugs.  The customer’s physical point of access to the product is critical to how it is used — potentially more important than any online information.  In these cases, the print content may be more important than the online content, driving the process for how online channels deliver the content.  Not many industries are in this situation and those that are can be at risk of becoming isolated from the mainstream of web developments.  

XML still has a role to play in the management of certain kinds of digital content.  Because XML is older and has a deeper legacy, it has been decidedly more expressive until recently.  Expressiveness relates to the ability to define concepts unambiguously.  People used to fault the JSON format for lacking a schema like XML has, though JSON now offers such a schema.  XML is still more robust in its ability to specify highly complex data structures, though in many cases alternatives exist that are compatible with JSON.   Document-centric sectors such as finance and pharmaceuticals, which have burdensome regulatory reporting requirements, remain heavy users of XML.  Big banks and other financial institutions, which are better known for their hesitancy than their agility, still use XML to exchange financial data with regulators. But the fast-growing FinTech sector is API-centric and is not XML-focused.  The key difference is the audience.  Big regulated firms are focused on the needs of a tightly knit group of stakeholders (suppliers, regulators, etc.) and prioritize the bulk exchange of data with these stakeholders.  Firms in more competitive industries, especially startups, are focused on delivering content to diverse customers, not bulk uploads.  

XML and content agility

The downside of expressiveness is heaviness.  XML has been criticized as verbose and heavy — much like Victorian literature.  Just as Dickensian prose has fallen out of favor with contemporary audiences, verbose markup is becoming less popular.  Anytime people can choose between a simple way or a complex one to do the same thing, they choose the simple one.  Simple, plain, direct. They don’t want elaborate expressiveness all the time, only when they need it.  

When people talk about content as being intelligent (understandable to other machines), they may mean different things.  Does the machine need to be able to understand everything about all the content from another source, or does it only need to have a short conversation with the content?  XML is based on the idea that different machines share a common schema or basis of understanding. It has a rigid formal grammar that must be adhered to. APIs are less worried about each machine understanding everything about the content coming from everywhere else.  It only cares about understanding (accessing and using) the content it is interested in (a query). That allows for more informal communication. By being less insistent on speaking an identical formal language, APIs enable content to be exchanged more easily and used more widely.  As a result, content defined by APIs more ubiquitous: able to move quickly to where it’s needed.  

Ultimately, XML and APIs embrace different philosophies about content.  XML provides a monolithic description of a huge block of content.  It’s concerned with strictly controlling a mass of content and involves a tightly coupled chain of dependencies, all of which must be satisfied for the process to work smoothly.  APIs, in contrast, are about connecting fragments of content.  It’s a decentralized, loosely coupled, bottom-up approach.  (The management of content delivered by APIs is handled by headless content models, but that’s another topic.)

Broadly speaking, APIs treat the parts as more important than the whole.  XML treats the whole as more important than the parts.  

Our growing reliance on the cloud has made it increasingly important to connect content quickly.  That imperative has made content more open.  And openness depends on outsiders being able to understand what the content is and use it quickly.  

As XML has declined in popularity, one of its core ideas has been challenged.  The presumption has been that the more markup in the content, the better.  XML allows for many layerings of markup, which can specify what different parts of text concern.  The belief was that this was good: it made the text “smarter” and easier for machines to parse and understand.  In practice, this vision hasn’t happened.  XML-defined text could be subject to so many parenthetical qualifications that it was like trying to parse some arcane legalese.  Only the author understood what was meant and how to interpret it.  The “smarter” the XML document tried to be, the more illegible it became to people who had to work with the document — other authors or developers who would do something later with the content.    Compared with the straightforward language of key-value pairs and declarative API requests, XML documentation became an advertisement pointing out how difficult its markup is to use.  “The limitations in JSON actually end up being one of its biggest benefits. A common line of thought among developers is that XML comes out on top because it supports modeling more objects. However, JSON’s limitations simplify the code, add predictability and increase readability.”  Too much expressiveness becomes an encumbrance.  

Like any monolithic approach, XML has become burdened by details as it has sought to address all contingencies.  As XML ages, it suffers from technical debt.  The specifications have grown, but don’t necessarily offer more.  XML’s situation today similar to Latin’s situation in the 18th century, when scientists were trying to use it to communicate scientific concepts.  One commenter asserts that XML suffers from worsening usability: “XML is no longer simple. It now consists of a growing collection of complex connected and disconnected specifications. As a result, usability has suffered. This is because it takes longer to develop XML tools. These users are now rooting for something simpler.”  Simpler things are faster, and speed matters mightily in the connected cloud.  What’s relevant depends on providing small details right when they are needed.

At a high level, digital content is bifurcating between API-first approaches and those that don’t rely on APIs.  An API-first approach is the right choice when content is fast-moving.  And nearly all forms of content need to speed up and become more agile.  Content operations are struggling to keep up with diversifying channels and audience segmentation, as well as the challenges of keeping the growing volumes of online content up-to-date.  While APIs aren’t new anymore, their role in leading how content is organized and delivered is still in its early stages.  Very few online publishers are truly API-first in their orientation, though the momentum of this approach is building.

When content isn’t fast-moving, APIs are less important. XML is sometimes the better choice for slow-moving content, especially if the entire corpus is tightly constructed as a single complex entity.  Examples are legal and legislative documents or standards specifications. XML will still be important in defining the slow-moving foundations of certain core web standards or ontologies like OWL — areas that most web publishers will never need to touch.  XML is best suited for content that’s meant to be an unchanging record.  

 Within web content, XML won’t be used as a universal language defining all content, since most online content changes often.  For those of us who don’t have to use XML as our main approach, how is it relevant?  I expect XML will play niche roles on the web.  XML will need to adapt to the fast-paced world of APIs, even if reluctantly.  To be able to function more agilely, it will be used in a selective way to define fragments of content.  

An example of fragmental XML is how Google uses the SSML standard, an XML-defined speech markup standard to indicate speech emphasis and pronunciation.  This standard predates the emergence of consumer voice interfaces, such as “Hey Google!” Because it was in place already, Google has incorporated it within the JSON-defined schema.org semantic metadata they use.  The XML markup, with its angled brackets, is inserted within the quote marks and curly brackets of JSON.   JSON describes the content overall, while XML provides assistance to indicate how to say words aloud. 

SVG, used to define vector graphics, is another example of fragmental XML.  SVG image files are embedded in or linked to HTML files without needing to have the rest of the content be in XML.

More generally, XML will exist on the web as self-contained files or as snippets of code.  We’ll see less use of XML to define the corpus of text as a whole.  The stylistic paradigm of XML, of using in-line markup  — comments within a sentence — is losing its appeal, as it is hard for both humans and machines to read and parse. An irony is that while XML has made its reputation for managing text, it is not especially good at managing individual words.  Swapping words out within a sentence is not something that any traditional programming approach does elegantly, whether XML-based or not, because natural language is more complex than an IT language processor.  What’s been a unique advantage of XML — defining the function of words within a sentence — is starting to be less important.  Deep learning techniques (e.g., GPT-3) can parse wording at an even more granular level than XML markup, without the overhead.  Natural language generation can construct natural sounding text.  Over time, the value of in-line markup for speech, such as used in SSML, will diminish as natural language generation improves its ability to present prosody in speech.  While deep learning can manage micro-level aspects of words and sentences, it is far from being about to manage structural and relational dimensions of content.  Different approaches to content management, whether utilizing XML or APIs connected to headless content models, will still be important.  

As happened with Latin, XML is evolving away from being a universal language.  It is becoming a controlled vocabulary used to define highly specialized content objects.  And much like Latin gave as the alphabet upon which many languages are built, XML has contributed many concepts to content management that other languages will draw up for years to come.  XML may be becoming more of a niche, but it’s a niche with an outsized influence.

— Michael Andrews

Categories
Big Content

Who benefits from schema.org?

Schema.org shapes the behavior of thousands of companies and millions of web users.  Even though few users of the internet have heard of it, the metadata standard exerts a huge influence in how people get information, and from whom they get it.  Yet the question of who precisely benefits from the standard gets little attention.  Why does the standard exist and does everyone materially benefit equally from it?  While metadata standard may sound like a technical mechanism generating impartial outcomes, metadata is not always fair in its implementation.  

Google has a strong vested interest in the fate of schema.org — but it is not the only party affected.  Other parties need to feel there are incentives to support schema.org.  Should they feel they experience disincentives, that sentiment could erode support for the standard.

As schema.org has grown in influence over the past decade, that growth has been built upon a contradictory dynamic.  Google needs to keep two constituencies satisfied to grow the usage of Google products.  It needs publishers to use schema.org so it has content for its products.  Consumers need that content to be enticing enough to keep using Google products.  But to monetize its products, Google needs to control how schema.org content is acquired from publishers and presented to customers.  Both these behaviors by Google act as disincentives to schema.org usage by publishers and consumers.  

To a large extent, Google has managed this contradiction by making it difficult for various stakeholders to see how it influences their behavior. Google uses different terminology, different rationales, and even different personas to manage the expectations of stakeholders. How information about  schema.org is communicated does not necessarily match the reality of it in practice.  

Although schema.org still has a low public profile, more stakeholders are starting to ask questions about it. Should they use schema.org structured data at all?  Is how Google uses this structured data unfair?  

To assess the fairness of schema.org involves looking at several inter-related issues: 

  • What schema.org is
  • Who is affected by it 
  • How it benefits or disadvantages different stakeholders in various ways.  

What kind of entity is schema.org?

Before we can assess the value of schema.org to different parties, we need to answer a basic question is: What is it, really? If everyone can agree on what it is we are referring to, it should be easier to see how it benefits various stakeholders.  What seems like a simple question defies a clear simple answer.  Yet there are multiple definitions of schema.org out there, supplied by schema.org, Google, and the W3C. The schema.org website refers to it as a “collaborative, community activity,” which doesn’t offer much precision. The most candid answer is that schema.org is a chameleon. It changes its color depending on its context.

Those familiar with the schema.org vocabulary might expect that schema.org’s structured data would provide us with an unambiguous answer to that question. A core principle of the schema.org vocabulary is to indicate an entity type.  The structured data would reveal the type of entity we are referring to. Paradoxically, the schema.org website doesn’t use structured data.  While it talks about schema.org, it never reveals through metadata what type of entity it is.

We are forced to ascertain the meaning of schema.org by reading its texts.  When looking at the various ways it’s discussed, we can see schema.org has embodying four distinct identities.  It can be a:

  1. Brand
  2. Website
  3. Organization
  4. Standard

Who is affected by schema.org?

A second basic question: Who are the stakeholders affected by schema.org?  This includes not only who schema.org is supposed to be for, but also who gets overlooked or disempowered.  We can break these stakeholders into segments:

  • Google (the biggest contributor to schema.org and its biggest user of data utilizing the metadata)
  • Google’s search engine competitors who are partners (“sponsors”) in the schema initiative (Microsoft, Yahoo/Verizon, and Russia’s Yandex)
  • Firms that develop IT products or services other than search engines (consumer apps, data management tools) that could be competitive with search engines
  • Publishers of web content, which includes
    •  Commercial publishers who rely on search engines for revenues and in some cases may be competitors to Google products
    •  Non-Commercial publishers (universities, non-profits, religious organizations, etc
  • Consumers and the wider public that encounter and rely on schema.org-described data
  • Professional service providers that advise others on using schema.org such as SEO consultants and marketing agencies
  • The W3C, which has lent its reputation and accommodation to schema.org

By looking at the different dimensions of schema.org and the different stakeholders, we can consider how each interacts.  

Schema.org is very much a Google project — over which it exerts considerable control.  But it cannot appear to be doing so, and therefore relies on various ways of distancing itself from appearing to mandate decisions.  

Schema.org as a brand

Schema.org may not seem like an obvious brand. There’s no logo.  The name, while reassuringly authoritative-sounding, does not appear to be trademarked.  Even how it is spelled is poorly managed.  It is unclear if it is meant to be lower case or uppercase — both are used in the documentation, in some cases within the same paragraph (I mostly use lowercase, following examples from the earliest discussions of the standard.)

Brands are about the identity of people, products, or organizations.  A brand foremost is a story that generates an impression of the desirability and trustworthiness of a tangible thing.  The value of a brand is to attract favorable awareness and interest.   

Like many brands, schema.org has a myth about its founding, involving a journey of its heroes.

 Schema.org’s mythic story involves three acts: life before schema.org, the creation of schema.org, and the world after schema.org.

Life before schema.org is presented as chaotic.  Multiple semantic web standards were competing with one another.  Different search engines prioritized different standards.  Publishers such as Best Buy made choices on their own about which standard to adopt.  But many publishers were waiting on the sidelines, wondering what the benefits would be for them.

The schema.org founders present this period as confusing and grim.  But an alternate interpretation is that this early period of commercializing semantic web metadata was full of fresh ideas and possibilities.  Publishers rationally asked how they would benefit eventually, but there’s little to suggest they feared they would never benefit.  In short, the search engines were the ones complaining about having to deal with competition and diversity in standards. Complaints by publishers were few.

With the formation of schema.org, the search engines announced a new standard to end the existence of other general-coverage semantic web metadata standards.  This standard would vanquish the others and end the confusion about which one to follow. Schema.org subsumed or weeded out competing standards. With the major search engines no longer competing with one another and agreeing to a common standard, publishers would be clear about expectations relating to what they were supposed to do.  The consolidation of semantic web standards into one is presented as inevitable.  This outcome is rationalized with the “TINA” justification: there is no alternative.  And there was no alternative for publishers, once the search engines collectively seized control of the semantic metadata standards process.

After schema.org consolidated the semantic web metadata universe, everyone has benefits, in this narrative.  The use of semantic metadata has expanded dramatically.  The coverage of schema.org has become more detailed over time. These metrics demonstrate its success.  Moreover, the project has become a movement where many people can now participate.   Schema.org positions itself as a force of enlightenment rising about the petty partisan squabbles that bedeviled other vocabularies in the past.   A semi-official history of schema.org states: “It would also be unrecognizable without the contributions made by members of the wider community who have come together via W3C.” 

 The schema.org brand story never questions other possibilities.  It assumes that competition was bad, rather than seeing it as representing a diversity of viewpoints that might have shaped things differently.  It assumes that the semantic web would never have managed to become commercialized, instead of recognizing the alternative commercial possibilities that might have emerged from the activity and interest by other parties.  

Any retrospective judgment that the commercialization semantic web would have failed to happen without schema.org consolidating things under the direction of search engines is speculative history.  It’s possible that multiple vocabularies could have existed side-by-side and could have been understood.  Humans speak many languages.  There’s no inherent reason why machines can’t as well.   Language diversity fosters expressive diversity.  

Schema.org as a website

Schema.org is a rare entity whose name is also a web address. 

If you want to visit schema.org, you head to the website.  There’s no schema.org convention or schema.org headquarters people can visit. If it isn’t clear who runs schema.org or how it works, at least the website provides palpable evidence that schema.org exists.   Even if it’s just a URL, it provides an address and promises answers.

At times, schema.org emphasizes that is just a website — and no more than that: “Schema.org is not a formal standards body. Schema.org is simply a site where we document the schemas that several major search engines will support.”

In its domain level naming, schema.org is a “dot-org,” which Wikipedia notes is “the domain is commonly used by schools, open-source projects, and communities, but also by some for-profit entities.”   Schema.org shares a TLD with such good samaritan organizations such as the Red Cross and the World Wildlife Foundation.  On first impression, schema.org appears to be a nonprofit charity of some sort.  

While the majority of schema.org’s documentation appears on its website, it sometimes has used the “WebSchemas” wiki on the W3C’s domain: https://www.w3.org/wiki/WebSchemas . The W3C is well regarded for its work as a nonprofit organization.  The not-for-profit image of the W3C’s hosting lends a halo of trust to the project.  

In reality, the website is owned by Google.  All the content on the schema.org website is subject to the approval of Google employees involved with the schema.org project.  Google also provides the internal search engine for the site, the Google Programmable Search Engine.

Screenshot Schema.org's Who Is listing
“Who is” for schema.org

Schema.org as an organization

Despite schema.org’s disavowal of being a standards body, it does in fact create standards and needs an organizational structure to allow that to happen.  

Schema.org’s formal organization involves two tiers:

  1. A steering group of the four sponsoring companies 
  2. A W3C community group

Once again, the appearances and realities of these arrangements can be quite different.

The steering group

While the W3C community group gets the most attention, one needs to understand the steering group first.  The steering group predates the community group and oversees it.  “The day to day operations of Schema.org, including decisions regarding the schema, are handled by a steering group” notes a FAQ.  The ultimate decision-making authority for schema.org rests with this steering group.

The steering group was formed at the start of the schema.org initiative.  According to steering group members writing in the ACM professional journal, “in the first year, these sponsor companies made most decisions behind closed doors. It incrementally opened up…”  

There are conflicting accounts about who can participate in the steering group.  The 2015 ACM article talks about “a steering committee [sic] that includes members from the sponsor companies, academia, and the W3C.”   The schema.org website focuses on search engines as the stakeholders who steer the initiative: “Schema.org is a collaboration between Google, Microsoft, Yahoo! and Yandex – large search engines who will use this marked-up data from web pages. Other sites – not necessarily search engines – might later join.”  A schema.org FAQ asks: “Can other websites join schema.org as partners and help decide what new schemas to support?” and the answer points to the steering committee governing this.  “The regular Steering Group participants from the search engines” oversee the project.  There have been at least two invited outside experts who have participated as non-regular participants, but the current involvement by outside participants in the steering group is not clear.

Schema.org projects the impression that it is a partnership of equals in the search field, but the reality belies that image. Even though the four search engines describe the steering group as a “collaboration,” the participation by sponsors seems unbalanced. With a 90% market share, Google’s dominance of search is overwhelming, and they have a far bigger interest in the outcomes than the other sponsors.  Since schema.org was formed nearly a decade ago, Microsoft has shifted its focus away from consumer products: dropping smartphones and discontinuing its Cortana voice search — both products that would have used schema.org.  Yahoo has ceased being an independent company and has been absorbed by Verizon, which is not focused on search.  Without having access to the original legal agreement between the sponsors, it’s unclear why either of these companies continues to be involved in schema.org from a business perspective.

The steering group is chaired by a Google employee: Google Fellow R.V. Guha. “R.V. Guha of Google initiated schema.org and is one of its co-founders. He currently heads the steering group,” notes the schema.org website.  Guha’s Wikipedia entry also describes him as being the creator of schema.org. 

Concrete information on the steering group is sparse.  There’s no information published about who is eligible to join, how schema.org is funded, and what criteria it uses to make decisions about what’s included in the vocabulary.  

What is clear is that the regular steering group participation is limited to established search engines, and that Google has been chair.  Newer search engines such as DuckDuckGo aren’t members.  No publishers are members.  Other firms exploring information retrieval technologies such as knowledge graphs aren’t members either.  

The community group

In contrast to the sparse information about the steering group, there’s much more discussion about the W3C community group, which is described as “the main forum for the project.”  

The community group, unlike the steering group, has open membership.  It operates under the umbrella of the W3C, “the main international standards organization for the World Wide Web,” in the words of Wikipedia.  Google Vice President and Chief Internet Evangelist, Vint Cerf, referred to as a “father” of the internet, brokered the ties between schema.org and the W3C.  “Vint Cerf helped establish the relations between Schema.org and the W3C.”  If schema.org does not wish to be viewed as a standard, they choose an odd partner by selecting the W3C.  

The W3C’s expectations for community groups are inspiring: ”Community Groups enable anyone to socialize their ideas for the Web at the W3C for possible future standardization. “  In the W3C’s vision, anyone can influence standards.  

Screenshot of W3C community group process
W3C’s explanation of community groups

The sponsors also promote the notion that the community group is open, saying the group “make[s] it easy for publishers/developers to participate.” (ACM)  

The vague word “participation” appears multiple times in schema.org literature:  “In addition to people from the founding companies (Google, Microsoft, Yahoo and Yandex), there is substantial participation by the larger Web community.” The suggestion implied is that everyone is a participant with equal ability to contribute and decide.  

While communities are open to all to join, that doesn’t mean that everyone is equal in decision making in the schema.org’s case — notwithstanding the W3C’s vision.  Everyone can participate, but not everyone can make decisions.

Publishers are inherently disadvantaged in the community process.  Their suggestions are less important than those of search engines, who are the primary consumer of schema.org structured data.  “As always we place high emphasis on vocabulary that is likely to be consumed, rather than merely published.”

Schema.org as a standard

Schema.org does not refer to itself as a standard, even though in practice it is one.  Instead, schema.org relies on more developer-focused terminology: vocabulary, markup, and data models.  It presents itself as a helpful tool for developers rather than as a set of rules they need to follow.  

Schema.org aims to be monolithic, where no other metadata standard is needed or used. The totalistic name chosen — schema.org — suggests that no other schema is required.  “For adoption, we need a simpler model, where publishers can be sure that a piece of vocabulary is indeed part of Schema.org.”

The search engine sponsors discourage publishers from incorporating other semantic vocabularies together with schema.org.  This means that only certain kinds of entities can be described and only certain details.  So while schema aims to be monolithic, it can’t describe many of the kinds of details that are discussed in Wikipedia.  The overwhelming focus is on products and services that promote the usage of search engines.   The tight hold prevents other standards from emerging that are outside of the influence of schema.org’s direction.  

Schema.org’s operating model is to absorb any competing standard that gains popularity.  “We strongly encourage schema developers to develop and evangelize their schemas. As these gain traction, we will incorporate them into schema.org.”  In doing this, schema.org groups avoid the burdens of developing on their own coverage of large domains involving fine details and requiring domain-specific expertise.  Schema.org gets public recognition for offering coverage of these if it decides it would benefit schema.org’s sponsors.   Schema.org has absorbed domain-specific vocabularies relating to autos, finance, and health, which allows search engines to present detailed information relating to these fields.  

How Google shapes schema.org adoption

Google exerts enormous power over web publishing.  Many webmasters and SEO specialists devote the majority of their time satisfying the requirements that Google imposes on publishers and other businesses that need an online presence.  

Google shapes the behavior of web publishers and other parties through a combination of carrots and sticks.

Carrots: Google the persuader

Because Google depends on structured data to attract users who will see its ads, it needs to encourage publishers to adopt schema.org.  The task is twofold: 

  1. Encourage more adoption, especially by publishers that may not have had much reason to use schema.org 
  2. Maintain the use of schema.org by existing publishers and keep up interest

Notably, how schema.org describes its benefits to publishers is different from how Google does.  

According to schema.org, the goal is to “make it easier for webmasters to provide us with data so that we may better direct users to their sites.”  The official rationale for schema.org is to help search engines “direct” users to “sites” that aren’t owned by the search engine.   

“When it is easier for webmasters to add markup, and search engines see more of the markup they need, users will end up with better search results and a better experience on the web.”   The official schema.org rationale, then, is that users benefit because they get better results from their search.  Because webmasters are seeking to direct people to come to their site, they will expect that the search results will direct users there.

“Search engines are using on-page markup in a variety of ways. These projects help you to surface your content more clearly or more prominently in search results.”   Again, the implied benefit of using schema.org is about search results — links people can click on to take them to other websites.  

Finally, schema.org dangles a vaguer promise that parties other than search engines may use the data for the benefit of publishers: “since the markup is publicly accessible from your web pages, other organizations may find interesting new ways to make use of it as well.”  The ability of organizations other than search engines to use scheam.org metadata is indeed a genuine possibility, though it’s one that hasn’t happened to any great extent.  

When Google talks about the benefits, they are far more obtuse.  The first appeal is to understanding: make sure Google understands your content.  “If you add schema.org markup to your HTML pages, many companies and products—including Google search—will understand the data on your site. Likewise, if you add schema.org markup to your HTML-formatted email, other email products in addition to GMail might understand the data.”     Google is one of “many” firms in the mix, rather than the dominant one.  Precisely what the payoff is from understanding is not explicitly stated.

The most tangible incentive that Google dangles to publishers to use schema.org is cosmetic: they gain a prettier display in search.  “Once Google understands your page data more clearly, it can be presented more attractively and in new ways in Google Search.”

Google refers to having a more desirable display as “rich snippets,” among other terms.  It has been promoting this benefit from the start of schema.org: “The first application to use this markup was Google’s Rich Snippets, which switched over to Schema.org vocabulary in 2011.” 

The question is how enticing this carrot is.  

Sticks: Google the enforcer

Google encourages schema.org adoption through more coercive measures as well.  It does so in three ways:

  1. Setting rules that must be followed
  2. Limiting pushback through vague guidelines and uncertainty
  3. Invoking penalties

Even though the schema.org standard is supposedly voluntary and not “normative” about what must be adopted, Google’s implementation is much more directive.  

Google sets rules about how publishers must use schema.org.  Broadly, Google lays down three ultimatums:

  1. Publishers must use schema.org to appear favorably in search
  2. They can only use schema.org and no other standards that would be competitive with it
  3. They must supply data to Google in certain ways  

An example of a Google ultimatum relates to its insistence that only the schema.org vocabulary be used — and no others.  Google has even banned the use of another vocabulary that it once backed: the data vocabulary.  Google prefers to consolidate all web metadata descriptions into the schema.org, which it actively controls.  “With the increasing usage and popularity of schema.org we decided to focus our development on a single SD [structured data] scheme. “  Publishers who continue to use the non-unfavored vocabulary face a series of threats.  Google here is making a unilateral decision about what kinds of metadata are acceptable to it.  

Google imposes a range of threats and penalties for non-compliance.  These tactics are not necessarily specific to schema.org structured data.   Google has used such tactics to promote the use of its “AMP” standard for mobile content.  But these tactics are more significant in the context of the schema.org vocabulary, which is supposed to be a voluntary and public standard.  

Google is opaque how schema.org could influence rankings.  If used incorrectly might your ranking be hurt or even disappear? 

Screenshot of article on structured data in search
Example of anxiety about search rankings and schema.org usage

Google never suggests that schema.org can positively influence search rankings.  But it leaves open the possibility that not using it could negatively influence rankings.  

Google’s threats and penalties relating to schema.org usage can be categorized into four tactics:

1. Warnings — messages in yellow that the metadata aren’t what Google expects

2. Errors — messages in red that Google won’t accept the structured data

3. Being ignored — a threat that the content won’t be prioritized by Google

4. Manual actions — a stern warning that the publisher will be sanctioned by Google

Manual actions are the death sentence that Google applies.   Publishers can appeal to Google to change its decision.  But ultimately Google decides what it wants and without a reversal of Google’s prior decision, the publisher is ostracized from Google and won’t be found by anyone searching for them.  The publisher becomes persona non grata. 

An example of a “manual action” sanction is if a publisher posts a job vacancy but there’s “no way to apply for the job” via the schema.org mechanism.  That’s doesn’t imply there’s no job —  it simply means that the poster of the job decided not to agree to Google’s terms: that they had to allow Google to let people apply from Google’s product, without the benefit of additional information that Google doesn’t allow to be included.  

While publishers may not like how they are treated, Google makes sure they have no grounds to protest. Google manages publisher expectations around fairness by providing vague guidance and introducing uncertainty.

 A typical Google statement: “Providing structured data to enable an enhanced search feature is no guarantee that the page will appear with that designated feature; structured data only enables a feature to be displayed. Google tries to display the most appropriate and engaging results to a user, and it might be the case that the feature you code for is not appropriate for a particular user at a particular time.” Google is the sole arbiter of fairness.  

In summary, Google imposes rules on publishers while making sure those rules present no obligations on Google.  Its market power allows them to do this.

How Google uses schema.org to consolidate market dominance

While there’s been much discussion by regulators, journalists, economists, and others about Google’s dominance in search, little of this discussion has focused on the role of schema.org in enabling this dominance.  But as we have seen, Google has rigged the standards-making process and pressed publishers to use schema.org under questionable pretenses.  It has been able to leverage the use of structured data based on the schema.org metadata standard to solidify its market position.

Google has used schema.org for web scraping and to build vertical products.  In both these cases, they are taking away opportunities from publishers who rely on Google and who in many cases are customers of Google’s ad business.  

Web scraping 2.0 and content acquisition

Google uses schema.org to harvest content from publisher websites.  It promotes the benefits to publishers of having this content featured as a “rich snippet” or other kinds of “direct answer” — even if there’s no link to the website of the publisher contributing the information.  

For many years governments and publishers around the world have accused Google of scrapping content without consent.  This legally fraught area has landed Google in political trouble.  A far more desirable option for Google would be to scrape web content with implied content.  Schema.org metadata is a perfect vector for Google to acquire vast quantities of content easily. The publishers do the work for providing the data in a format that Google can easily find and process.  And they do this voluntarily — in the belief that it will help them attract more customers to their websites.  In many ways, this is a big con.  There’s growing evidence that publishers are disadvantaged when their content appears as a direct answer.  And it’s troubling to think they are doing much work for Google to take advantage of them.  In the absence of negotiating power toward Google, they consent to do things that may harm them.

In official financial disclosures filed with the US government, Google has acknowledged that wants to avoid presenting links to other firms’ websites in its search results.  Instead, it uses schema.org data to provide “direct answers” (such as Rich Snippets) so that people stay on Google products.  “Instead of just showing ten blue links in our search results, we are increasingly able to provide direct answers — even if you’re speaking your question using Voice Search — which makes it quicker, easier and more natural to find what you’re looking for” Google noted in SEC filing last year

The Markup notes how publishers may be penalized when their content appears as a “rich snippet” if the user is looking for external links: “Google further muddied the waters recently when it decided that if a non-Google website link appears in a scraped “featured snippet” module, it would remove that site from traditional results below (if it appeared there) to avoid duplication.” 

One study by noted SEO expert Rand Fiskin found that half of all Google searches result in zero clicks.

Verticals

Google has developed extensive coverage of details relating to specific “verticals” or industry sectors.  These sectors typically have data-rich transactional websites that customers use to buy services, find things, or manage their accounts.  Google has been interested in limiting consumer  use of these websites and keeping consumers on Google products.  For example, instead of encouraging people who are searching for a job to visit a job listing website, Google would prefer them to explore job opportunities while staying on Google’s search page— whether or not all the relevant jobs are available without visiting another website.  Google is directly competing with travel booking websites, job listing websites, hotel websites, airline websites, and so on.  Nearly all these sites need to pay Google for ads to appear near the top of the search page, even though Google is working to undermine the ability of users to find and access links to these sites.

According to internal Google emails submitted to the US Congress, Google decided to compete directly with firms in specific industry sectors.  To quote these Google emails:

What is the real threat if we don’t execute on verticals?”

a) “Loss of traffic from google.com … .

b) Related revenue loss to high spend verticals like travel.

c) Missing oppty if someone else creates the platform to build verticals.

d) If one of our big competitors builds a constellation of high quality verticals, we are hurt badly.”

Examples of verticals where Google has encouraged schema.org to build out detailed metadata include: 

  • Jobs and employers 
  • Books 
  • Movies 
  • Courses 
  • Events 
  • Products 
  • Voice-enabled content 
  • Subscription content 
  • Appointments at local businesses (restaurants, spas, health clubs)

Google employees are recognized as contributing the key work in the schema.org vocabulary in two areas pertaining to verticals:

  • “Major topics”  — coverage of verticals mentioned previously
  • The “actions vocabulary” which enables the bypassing of vertical websites to do tasks.

The actions vocabulary is an overlooked dimension of Google’s vertical strategy.  Actions let you complete transactions from within Google products without needing to access the websites of the service provider. (Actions in the schema.org vocabulary are not related to Google’s Manual Actions sanctions discussed earlier.) The schema.org website notes: “Sam Goto of Google did much of the work being schema.org’s Actions vocabulary.”  This Google employee, who was responsible for the problem-space of actions, explained on his blog the barriers to consumers (and Google) to complete actions across websites:

  • “the opentable/urbanspoon/grubhub can’t reserve *any* arbitrary restaurant, they represent specific ones”
  • “Netflix/Amazon/Itunes can’t stream *any* arbitrary movie, there is a specific set of movies available”
  • “Taxis have their own coverage/service area”
  • “AA.com can’t check-in into UA flights”
  • “UPS can’t track USPS packages, etc.”

These transactions target access to the most economically significant service industries.  While consumers will like a unified way to choose things, they still expect options about the platform they can use to make their choices. In principle, the actions could have opened up competition.  But given Google’s domination of search and smartphone platforms, the benefits of actions have filtered up to Google, not down to consumers.  They may not see all the relevant options they need.    Google has decided what choices are available to the user.

There’s a difference between having the option of using a unified access point, and having only one access point.  When a unified access point becomes the only access point, it becomes a monopoly.

Schema.org and the illusion of choice

Google relies on tactics of “self-favoring” — a concept that current antitrust rules don’t regulate effectively.  But a recent analysis noted the need to address the problem: “Google’s business model rests on recoupment against monetization of data blended with the selling of adjunct services. Deprived of access to this, Google might not have offered these services, nor maintained them, or even developed them in the first place.”

Google cares about schema.org because it is profitably to do so — both in the immediate term and in the long term.  Google makes money by directing customer focus to its ad-supported products, and it consolidates its stranglehold on the online market as potential competitors get squeezed by its actions.

But if Google couldn’t exploit its monopoly market position to make unfair profits from schema.org, would it care about schema.org?  That’s the question that needs to be tested.  If it did, it would be willing to fund schema.org with third party oversight, and allow its competitors and others economically impacted by schema.org to have a voice it schema.org decision-making.  It would allow governments to review how Google uses the data it harvests from schema.org to present choices to customers.  

— Michael Andrews

Categories
Big Content

Time to end Google’s domination of schema.org

Few companies enjoy being the object of public scrutiny.  But Google, one of the world’s most recognized brands, seems especially averse.  Last year, when Sundar Pichai, Google’s chief executive, was asked to testify before Congress about antitrust concerns, he refused.  They held the hearing without his presence.  His name card was there, in front of an empty chair.

Last month Congress held another hearing on antitrust.  This time, Pichai was in his chair in front of the cameras, remotely if reluctantly.   During the hearings, Google’s fairness was a focal issue.  According to a summary of the testimony on the ProMarket blog of the University of Chicago Business School’s Stigler Center: “If the content provider complained about its treatment [by Google], it could be disappeared from search. Pichai didn’t deny the allegation.”

One of the major ways that content providers gain (or lose) visibility on Google — the first option that most people choose to find information — is through their use of a metadata standard known as schema.org.  And the hearing revealed that publishers are alleging that Google engages in bullying tactics relating to how their information is presented on the Google platform.  How might these issues be related?  Who sits in the chair that decides the stakes?

Metadata and antitrust may seem arcane and wonky topics, especially when looked at together. Each requires some basic knowledge to understand, so it is rare that the interaction between the two is discussed. Yet it’s never been more important to remove the obscurity surrounding how the most widely used standard for web metadata, schema.org influences the fortunes of Google, one of the most valuable companies in the world. 

Why schema.org is important to Google’s dominant market position

Google controls 90% of searches in the US, and its Android operating system powers 9 of 10 smartphones globally.  Both these products depend on schema.org metadata (or “structured data”) to induce consumers to use Google products.  

A recent investigation by The Markup noted that “Google devoted 41 percent of the first page of search results on mobile devices to its own properties and what it calls ‘direct answers.’”  Many of these direct answers are populated by schema.org metadata that publishers provide in hopes of driving traffic to their websites.  But Google has a financial incentive to stop traffic from leaving its websites.  The Markup notes that “Google makes five times as much revenue through advertising on its own properties as it does selling ad space on third-party websites.”  Tens of billions of dollars of Google revenues depend in some way on the schema.org metadata.  In addition to web search results, many Google smartphone apps including Gmail capture schema.org metadata that can support other ad-related revenues.  

During the recent antitrust hearings, Congressman Cicilline told Sundar Pichai that “Google evolved from a turnstile to the rest of the web to a walled garden.”  The walled garden problem is at the core of Google’s monopoly position.   Perversely, Google has been able to manipulate the use of public standards to create a walled garden for its products.  Google reaps a disproportionate benefit from the standard by preventing broader uses of the standards that could result in competitive threats to Google.

There’s a deep irony is that a W3C-sanctioned metadata standard, schema.org, has been captured by a tech giant not just to promote its unique interests but to limit the interests of others.  Schema.org was supposed to popularize the semantic web and help citizens gain unprecedented access to the world’s information. Yet Google has managed to monopolize this public asset. 

How schema.org became a walled garden

How Google came to dominate a W3C-affiliated standard requires a little history.  The short history is that Google has always been schema.org’s chief patron.  It created schema.org and promoted it in the W3C.  Since then, it has consolidated its hold on it. 

The semantic web — the inspiration for schema.org — has deep roots in the W3C.  Tim Berners Lee, the founder of the World Wide Web, coined the concept and has been its major champion.  The commercialization of the approach has been long in the making. Metaweb was the first venture-funded company to commercialize the semantic web with its product Freebase  The New York Times noted at the time: “In its ambitions, Freebase has some similarities to Google — which has asserted that its mission is to organize the world’s information and make it universally accessible and useful. But its approach sets it apart.”  Google bought Metaweb and its Freebase database in 2010, buying and removing a potential competitor.   The following year (2011) Google launched the schema.org initiative, bringing along Bing and Yahoo, the other search engines that competed with Google.  While the market share of Bing and Yahoo were small compared to Google, the launch initiative raised hopes that more options would be available for search.  Google noted: “With schema.org, site owners can improve how their sites appear in search results not only on Google, but on Bing, Yahoo! and potentially other search engines as well in the future.”  Nearly a decade later, there is even less competition in search than there was when schema.org was created.

In 2015 a Google employee proposed that schema.org become a W3C community group.  He soon became the chair of the group once it was formed.  

By making schema.org a W3C community, the Google-driven initiative gained credibility through its W3C endorsement as a community-driven standard. Previously, only Google and its initiative partners (Microsoft’s Bing, Yahoo, and later Russia’s Yandex) had any say over the decisions that webmasters and other individuals involved with publishing web content needed to follow, a situation which could have triggered antitrust alarms relating to collusion.   Google also faced the challenge of encouraging webmasters to adopt the schema.org standard.  Webmasters had been slow to embrace the standard and assume the work involved with using it.  Making schema.org an open community-driven standard solved multiple problems for Google at once.  

In normal circumstances — untinged by a massive and domineering tech platform — an open standard should have encouraged webmasters to participate in the standards-making process and express their goals and needs. Ideally, a community-driven standard would be the driver of innovation. It could finally open up the semantic web for the benefit of web users.  But the tight control Google has exercised over the schema.org community has prevented that from happening.

The murky ownership of the schema.org standard

From the beginnings of schema.org, Google’s participation has been more active than anyone else, and Google’s guidance about schema.org has been more detailed than even the official schema.org website.  This has created a great deal of confusion among webmasters about what schema.org requires for compliance to the standard, as opposed to what Google requires for compliance for its search results and ranking.  It’s common for an SEO specialist to ask in a schema.org forum a question about Google’s search results.  Even people with a limited knowledge of schema.org’s mandate assume — correctly — that it exists primarily for the benefit of Google.  

In theory, Google is just one of numerous organizations that implements a standard that is created by a third party.  In practice, Google is both the biggest user of the schema.org standard — and also its primary author.  Google is overwhelmingly the biggest consumer of schema.org structured data.  It also is by far the most active contributor to the standard.  Most other participants are along for the ride: trying to keep up with what Google is deciding internally about how it will use schema.org in its products, and what it is announcing externally about changes Google wants to make to the standard.

In many cases, if you want to understand the schema.org standard, you need to rely on Google’s documentation.  Webmasters routinely complain about the quality of schema.org’s documentation: its ambiguities, or the lack of examples.  Parts of the standard that are not priorities for Google are not well documented anywhere.  If they are priorities for Google, however, Google itself provides excellent documentation about how information should be specified in schema.org so that Google can use it.   Because schema.org’s documentation is poor, the focus of attention stays on Google.

The reliance that nearly everyone has on Google to ascertain compliance with schema.org requirements was highlighted last month by Google’s decision to discontinue its Structured Data Testing Tool, which is widely used by webmasters to check that their schema.org metadata is correct — at least as far as Google is concerned.  Because the concrete implementation requirements of schema.org are often murky, many rely on this Google tool to verify the correctness of the data independently of how the data would be used.  Google is replacing this developer-focused tool with a website that checks whether the metadata will display correctly in Google’s “rich results.”  The new “Rich Results Test Tool” acknowledges finally what’s been an open secret: Google’s promotion of schema.org is primarily about populating its walled garden with content.  

Google’s domination of the schema.org community

The purpose of a W3C group should be to serve everyone, not just a single company. In the case of schema.org, a W3C community has been dominated from the start by a single company: Google.

Google has chaired the schema.org community continuously since its inception in 2015.   Microsoft (Bing) and Yahoo (now Verizon), who are minor players in the search business, participate nominally but are not very active considering they were founding members of schema.org.  Google, in contrast, has multiple employees active in community discussions, steering the direction of conversations.  These employees shape the core decisions, together with a few independent consultants who have longstanding relationships with Google.  It’s hard to imagine any decision happening without Google’s consent.  Google has effective veto power over decisions.

Google’s domination of the schema.org community is possible because the community has no resources of its own.  Google conveniently volunteers the labor of its employees to perform duties related to community business, but these activities will naturally reflect the interests of the employer, Google.  Since other firms don’t have the same financial incentives that Google has through its market dominance of search and smartphones in the outcomes of schema.org decisions, they don’t allocate their employees to spend time on schema.org issues.  Google corners the discussion while appearing to be the most generous contributor.

The absence of governance in the schema.org community

The schema.org community essentially has zero governance — a situation Google is happy with.  There are no formal rules, no formal process for proposals and decisions, no way to appeal a decision, and no formal roles apart from the chair, who ultimately can decide everything. There’s no process of recusal.  Google holds sway in part because the community has no permanent and independent staff.  And there’s no independent board of oversight reviewing how business is conducted.

It’s tempting to see the absence of governance as an example of a group of developers who have a disdain for bureaucracy — that’s the view Google encourages.  But the commercial and social significance of these community decisions is enormous and shouldn’t be cloaked in capricious informality.  Moreover, the more mundane problems of a lack of process are also apparent.  Many people who attempt to make suggestions feel frozen out and unwelcome. Suggestions may be challenged by core insiders who have deep relationships with one another.  The standards- making process itself lacks standardization.  

 In the absence of governance, the possibilities of a conflict of interest are substantial.  First, there’s the problem of self-dealing: Google using its position as the chair of a public forum to prioritize its own commercial interests ahead of others.  Second, there’s the possibility that non-Google proposals will be stopped because they are seen as costly to Google, if only because they create extra work for the largest single user of schema.org structured data.  

As a public company, Google is obligated to its shareholders — not to larger community interests.  A salaried Google employee can’t simultaneously promote his company’s commercial interests and promote interests that could weaken his company’s competitive position.  

Community bias in decisions

Few people want an open W3C community to exhibit biases in their decisions.  But owing to Google’s outsized participation and the absence of governance, decision making that’s biased toward Google’s priorities is common.

Whatever Google wants is fast-tracked — sometimes happening within a matter of days.  If a change to schema.org is needed to support a Google product that needs to ship, nothing will slow down that from happening.

 Suggestions from people not affiliated with Google face a tougher journey.  If the suggestion does not match Google priorities, it is slow-walked. They will be challenged as to their necessity or practicality.  They will languish as an open issue in Github, where they will go unnoticed unless they generate an active discussion.  Eventually, the chair will cull proposals that have been long buried in the interest of closing out open issues.

While individuals and groups can propose suggestions of their own, successful ones tend to be incremental in nature, already aligned with Google’s agenda.  More disruptive or innovative ones are less likely to be adopted.

In the absence of a defined process, the ratification of proposals tends to happen through an informal virtual acclamation.  Various Google employees will conduct a public online discussion agreeing with one another on the merits of adopting a proposal or change.  With “community” sentiment demonstrated, the change is pushed ahead.  

Consumer harm from Google’s capture of schema.org

Google’s domination of schema.org is an essential part of its business model.  Schema.org structured data drives traffic to Google properties, and Google has leveraged it so that it can present fewer links that would drive traffic elsewhere.  The more time consumers spend on Google properties, the more their information decisions are limited to the ads that Google sells.  Consumers need to work harder to find “organic” links (objectively determined by their query and involving no payment to Google) to information sources they seek.

A technical standard should be a public good that benefits all.  In principle, publishers that use schema.org metadata should be able to expand the reach of their information, so that apps from many firms take advantage of it, and consumers have more choices about how and where they get their information.  The motivating idea behind semantic structured data such as schema.org provides is that information becomes independent of platforms.  But ironically, for consumers to enjoy the value of structured data, they mostly need to use Google products.  This is a significant market failure, which hasn’t happened by accident.

The original premise of the semantic web was based on openness.  Publishers freely offered information, and consumers could freely access it.  But the commercial version, driven by Google, has changed this dynamic.  The commercial semantic web isn’t truly open; it is asymmetrically open.  It involves open publishing but closed access.  Web publishers are free to publish their data using the schema.org standard and are actively encouraged to do so by Google. The barriers to creating structured data are minimal, though the barriers to retrieving it aren’t.  

Right now, only a firm with the scale of Google is in a position to access this data and normalize it into something useful for consumers.  Google’s formidable ad revenues allow it to crawl the web and harvest the data for its private gain.  A few other firms are also harvesting this data to build private knowledge graphs that similarly provide gated access.  The goal of open consumer access to this data remains elusive.  A small company may invest time or money to create structured data, but they lack the means to use structured data for their own purposes.   But it doesn’t have to be this way.   

Has Google’s domination of schema.org stifled innovation?

When considering how big tech has influenced innovation, it is necessary to pose a counterfactual question: What might have been possible if the heavy hand of a big tech platform hadn’t been meddling?

Google’s routine challenge to suggestions for additions to the schema.org vocabulary is to question whether the new idea will be used.  “What consuming application is going to use this?” is the common screening question.  If Google isn’t interested in using it, why is it worthwhile doing?  Unless the individual making the suggestion is associated with a huge organization that will build significant infrastructure around the new proposal, the proposal is considered unviable.  

The word choice of “consuming applications” is an example of how Google avoids referring to itself and its market dominance.  The Markup recently revealed how Google coaches its employees to avoid phrases that could get it in additional antitrust trouble.  Within the schema.org community group, Google employees strive to make discussion appear objective, where Google seems disinterested in the decision.  

One area where Google has discouraged alternative developments is in discouraging the linking of schema.org data with data using other metadata vocabularies (standards).  This is significant for multiple reasons.  The schema.org vocabulary is limited in its scope, mostly focusing on commercial entities and not on non-commercial entities.  Because Google is not interested in non-commercial entity coverage, publishers need to rely on other vocabularies.  But Google doesn’t want to look at other vocabularies, claiming that it is too taxing for them to crawl data described by other vocabularies.  In this, Google is making a commercial decision that goes against the principles of linked data (a principle of the semantic web), which explicitly encourages the mixing of vocabularies. For publishers, they are forced to obey Google’s diktats.  Why should they supply metadata that Google, the biggest consumer of schema.org metadata, says it will ignore?  With a few select exceptions, Google mandates that only schema.org metadata should be used in web content and no other semantic vocabularies.  Google sets the vision of what schema.org is, and what it does.

To break this cycle, the public should be asking: How might consumers access and utilize information from the information commons without relying on Google?

There are several paths possible.  One might involve opening up the web crawl to wider use by firms of all sizes.  Another would be to expand the role of the schema.org vocabularies in APIs to support consumer apps.  Whatever path is pursued, it needs to be attractive to small firms and startups to bring greater diversity to consumers and spark innovation.

Possibilities for reform: Getting Google out of the way

If schema.org is to continue as a W3C community and associated with the trust conferred by that designation, then it will require serious reform.  It needs governance — and independence from Google.  It may need to transform into something far more formal than a community group.  

In its current incarnation, it’s difficult to imagine this level of oversight.  The community is resource-starved, and relies on Google to function. But if schema.org isn’t viable without Google’s outsized involvement, then why does it exist at all?  Whose community is it?

There’s no rationale to justify the W3C  lending its endorsement to a community that is dominated by a single company.  One solution is for schema.org to cease being part of the W3C umbrella and return to its prior status of being a Google-sponsored initiative.  That would be the honest solution, barring more sweeping changes.

Another option would be to create a rival W3C standard that isn’t tied to Google and therefore couldn’t be dominated by it, but a standard Google couldn’t afford to ignore.  That would be a more radical option, involving significant reprioritization by publishers.  It would be disruptive in the short term, but might ultimately result in greater innovation.  A starting point for this option would be to explore how to popularize Wikidata as a general-purpose vocabulary that could be used instead of schema.org.

A final option would be for Google to step up in order to step down.  They could acknowledge that they have benefited enormously from the thousands of webmasters and others who contribute structured data and they owe a debt to them.  They could offer to payback in kind.  Google could draw on the recent example of Facebook’s funding of an independent body that will provide oversight that company.  Google could fund a truly independent body to oversee schema.org, and financially guarantee the creation of a new organizational structure.   Such an organization would leave no questions about how decisions are made and would dispel industry concerns that Google is gaining unfair advantages.  Given the heightening regulatory scrutiny of Google, this option is not as extravagant as it may first sound.

On a pragmatic level, I would like to see schema.org realize its full potential.  This issue is important enough to merit broader discussion, not just in the narrow community of people who work on web metadata, but those involved with regulating technology and looking at antitrust.  Google spends considerable sums, often furtively, hiring academic experts and others to dismiss concerns about their market dominance.  The role of metadata should be to make information more transparent.  That’s why this matters in many ways.

— Michael Andrews

Clarification: schema.org’s status as a  north star and as a standard (August 12)

The welcome page of schema.org notes it is “developed by an open community process, using the public-schemaorg@w3.org mailing list.”   When I first published this post I referred to schema.org as an “open W3C metadata standard.” Dan Brickley of Google tweeted to me and others stating that I made a “simple factual error” doing so. He is technically correct that my characterization of the W3C’s role is not precise, so I have changed the wording to say “a W3C-sanctioned metadata standard” instead (sanctioned = permitted), which is the most accurate I can manage, given the intentionally confusing nature of schema.org’s mandate.  This may seem like mincing words, but the implications are important, and I want to elaborate on what those are.

It is true that schema.org is not an official W3C standard in the sense that HTML5 is, which had a cast of thousands involved in its development.  For standards to become an official W3C standard, they need to go through a long process of community vetting, moving through stages such as being a recommendation first.  Even a recommendation is not yet an official standard, though it is widely followed.  Just because technical guidelines aren’t official W3C standards or are even referred to as standards does not mean they don’t have the effect of a standard that others would be expected to follow in order to gain market acceptance. Standards vary in the degree they are voluntary — schema.org has always been a voluntary standard.  And there are different levels of standards maturity within the W3C’s standards making framework, with the most mature ones reflecting the most stringent levels of compliance.  A W3C community group discussions around standards proposals would be the least rigorous and normally associated with the least developed stage of standards activity.  It is typically associated with new ideas for standards, rather than well-formed standards that are widely used by thousands of companies.  

A key difference with the schema.org community group is that is hosts discussions about a fully-formed standard.  This standard was fully formed before there was ever a community group to discuss it.  In other words, there was never any community input on the basic foundation of schema.org.  Google decided this together with its partners in the schema.org initiative.  

So I agree that schema.org fails to satisfy the expectations of a W3C standard.  The W3C has a well-established process for standards, and schema.org’s governance doesn’t remotely align with how a W3C standard is developed.  

The problem is that by having fully-formed standard being discussed in a W3C forum, it appears as if schema.org is a W3C standard of some sort.  Appearances do matter. Webmasters on the W3C mailing list can reasonably assume the W3C endorses schema.org.  And by hosting a community group on schema.org, the W3C has lent support to schema.org.  To outsiders, they appear to be sponsoring its development and one would presume be interested in having open participation in decision making about it.  The terms of service for schema.org treat “the schemas published by Schema.org as if the schemas were W3C Recommendations.”  The optics of schema.org imply it is W3C-ish.  

Dan Brickley refers to schema.org as a “independent project” and not a “W3C thing.”  I’m not reassured by that characterization, which is the first I’ve heard Google draw explicit distance from W3C affiliation.  He seems to be rejecting the notion that the W3C should provide any oversight over the schema.org process.  They’re merely providing a free mailing list.  The four corporate “sponsors” of schema.org set the binding conditions of the terms of service.  Nothing schema.org is working on is intended to become an official W3C standard and hence subject to W3C governance.  

Even though 10 million websites use schema.org metadata and are affected by its decisions, schema.org’s decision making is tightly held.   Ultimate decision making authority rests with a Steering Committee (also chaired by Google) that is invitation-only and not open to public participation.  Supposedly, a W3C representative is allowed to sit on this committee, though the details about this, like much else in schema.org’s documentation, are unclear.   

It may seem reassuring to imagine that schema.org belongs to a nebulous entity called the “community,” but that glosses over how much of the community activities and decisions are Google-driven. Google does draw on the expertise and ideas of others, so that schema.org is more than one company’s creation.  But in the end, Google keeps tight control over the process so that schema.org reflects its priorities.  It would be simpler to call this the Google Structured Data schema.  

 Schema.org appears to be public and open, while in practice is controlled by a small group of competitors and one firm in particular. Google is having its cake and eating it too.  If schema.org does not want W3C oversight, then the W3C should disavow having a connection with them, and help to reduce at least some of the confusion about who is in control of schema.org.