Categories
Big Content

Time to end Google’s domination of schema.org

Few companies enjoy being the object of public scrutiny.  But Google, one of the world’s most recognized brands, seems especially averse.  Last year, when Sundar Pichai, Google’s chief executive, was asked to testify before Congress about antitrust concerns, he refused.  They held the hearing without his presence.  His name card was there, in front of an empty chair.

Last month Congress held another hearing on antitrust.  This time, Pichai was in his chair in front of the cameras, remotely if reluctantly.   During the hearings, Google’s fairness was a focal issue.  According to a summary of the testimony on the ProMarket blog of the University of Chicago Business School’s Stigler Center: “If the content provider complained about its treatment [by Google], it could be disappeared from search. Pichai didn’t deny the allegation.”

One of the major ways that content providers gain (or lose) visibility on Google — the first option that most people choose to find information — is through their use of a metadata standard known as schema.org.  And the hearing revealed that publishers are alleging that Google engages in bullying tactics relating to how their information is presented on the Google platform.  How might these issues be related?  Who sits in the chair that decides the stakes?

Metadata and antitrust may seem arcane and wonky topics, especially when looked at together. Each requires some basic knowledge to understand, so it is rare that the interaction between the two is discussed. Yet it’s never been more important to remove the obscurity surrounding how the most widely used standard for web metadata, schema.org influences the fortunes of Google, one of the most valuable companies in the world. 

Why schema.org is important to Google’s dominant market position

Google controls 90% of searches in the US, and its Android operating system powers 9 of 10 smartphones globally.  Both these products depend on schema.org metadata (or “structured data”) to induce consumers to use Google products.  

A recent investigation by The Markup noted that “Google devoted 41 percent of the first page of search results on mobile devices to its own properties and what it calls ‘direct answers.’”  Many of these direct answers are populated by schema.org metadata that publishers provide in hopes of driving traffic to their websites.  But Google has a financial incentive to stop traffic from leaving its websites.  The Markup notes that “Google makes five times as much revenue through advertising on its own properties as it does selling ad space on third-party websites.”  Tens of billions of dollars of Google revenues depend in some way on the schema.org metadata.  In addition to web search results, many Google smartphone apps including Gmail capture schema.org metadata that can support other ad-related revenues.  

During the recent antitrust hearings, Congressman Cicilline told Sundar Pichai that “Google evolved from a turnstile to the rest of the web to a walled garden.”  The walled garden problem is at the core of Google’s monopoly position.   Perversely, Google has been able to manipulate the use of public standards to create a walled garden for its products.  Google reaps a disproportionate benefit from the standard by preventing broader uses of the standards that could result in competitive threats to Google.

There’s a deep irony is that a W3C-sanctioned metadata standard, schema.org, has been captured by a tech giant not just to promote its unique interests but to limit the interests of others.  Schema.org was supposed to popularize the semantic web and help citizens gain unprecedented access to the world’s information. Yet Google has managed to monopolize this public asset. 

How schema.org became a walled garden

How Google came to dominate a W3C-affiliated standard requires a little history.  The short history is that Google has always been schema.org’s chief patron.  It created schema.org and promoted it in the W3C.  Since then, it has consolidated its hold on it. 

The semantic web — the inspiration for schema.org — has deep roots in the W3C.  Tim Berners Lee, the founder of the World Wide Web, coined the concept and has been its major champion.  The commercialization of the approach has been long in the making. Metaweb was the first venture-funded company to commercialize the semantic web with its product Freebase  The New York Times noted at the time: “In its ambitions, Freebase has some similarities to Google — which has asserted that its mission is to organize the world’s information and make it universally accessible and useful. But its approach sets it apart.”  Google bought Metaweb and its Freebase database in 2010, buying and removing a potential competitor.   The following year (2011) Google launched the schema.org initiative, bringing along Bing and Yahoo, the other search engines that competed with Google.  While the market share of Bing and Yahoo were small compared to Google, the launch initiative raised hopes that more options would be available for search.  Google noted: “With schema.org, site owners can improve how their sites appear in search results not only on Google, but on Bing, Yahoo! and potentially other search engines as well in the future.”  Nearly a decade later, there is even less competition in search than there was when schema.org was created.

In 2015 a Google employee proposed that schema.org become a W3C community group.  He soon became the chair of the group once it was formed.  

By making schema.org a W3C community, the Google-driven initiative gained credibility through its W3C endorsement as a community-driven standard. Previously, only Google and its initiative partners (Microsoft’s Bing, Yahoo, and later Russia’s Yandex) had any say over the decisions that webmasters and other individuals involved with publishing web content needed to follow, a situation which could have triggered antitrust alarms relating to collusion.   Google also faced the challenge of encouraging webmasters to adopt the schema.org standard.  Webmasters had been slow to embrace the standard and assume the work involved with using it.  Making schema.org an open community-driven standard solved multiple problems for Google at once.  

In normal circumstances — untinged by a massive and domineering tech platform — an open standard should have encouraged webmasters to participate in the standards-making process and express their goals and needs. Ideally, a community-driven standard would be the driver of innovation. It could finally open up the semantic web for the benefit of web users.  But the tight control Google has exercised over the schema.org community has prevented that from happening.

The murky ownership of the schema.org standard

From the beginnings of schema.org, Google’s participation has been more active than anyone else, and Google’s guidance about schema.org has been more detailed than even the official schema.org website.  This has created a great deal of confusion among webmasters about what schema.org requires for compliance to the standard, as opposed to what Google requires for compliance for its search results and ranking.  It’s common for an SEO specialist to ask in a schema.org forum a question about Google’s search results.  Even people with a limited knowledge of schema.org’s mandate assume — correctly — that it exists primarily for the benefit of Google.  

In theory, Google is just one of numerous organizations that implements a standard that is created by a third party.  In practice, Google is both the biggest user of the schema.org standard — and also its primary author.  Google is overwhelmingly the biggest consumer of schema.org structured data.  It also is by far the most active contributor to the standard.  Most other participants are along for the ride: trying to keep up with what Google is deciding internally about how it will use schema.org in its products, and what it is announcing externally about changes Google wants to make to the standard.

In many cases, if you want to understand the schema.org standard, you need to rely on Google’s documentation.  Webmasters routinely complain about the quality of schema.org’s documentation: its ambiguities, or the lack of examples.  Parts of the standard that are not priorities for Google are not well documented anywhere.  If they are priorities for Google, however, Google itself provides excellent documentation about how information should be specified in schema.org so that Google can use it.   Because schema.org’s documentation is poor, the focus of attention stays on Google.

The reliance that nearly everyone has on Google to ascertain compliance with schema.org requirements was highlighted last month by Google’s decision to discontinue its Structured Data Testing Tool, which is widely used by webmasters to check that their schema.org metadata is correct — at least as far as Google is concerned.  Because the concrete implementation requirements of schema.org are often murky, many rely on this Google tool to verify the correctness of the data independently of how the data would be used.  Google is replacing this developer-focused tool with a website that checks whether the metadata will display correctly in Google’s “rich results.”  The new “Rich Results Test Tool” acknowledges finally what’s been an open secret: Google’s promotion of schema.org is primarily about populating its walled garden with content.  

Google’s domination of the schema.org community

The purpose of a W3C group should be to serve everyone, not just a single company. In the case of schema.org, a W3C community has been dominated from the start by a single company: Google.

Google has chaired the schema.org community continuously since its inception in 2015.   Microsoft (Bing) and Yahoo (now Verizon), who are minor players in the search business, participate nominally but are not very active considering they were founding members of schema.org.  Google, in contrast, has multiple employees active in community discussions, steering the direction of conversations.  These employees shape the core decisions, together with a few independent consultants who have longstanding relationships with Google.  It’s hard to imagine any decision happening without Google’s consent.  Google has effective veto power over decisions.

Google’s domination of the schema.org community is possible because the community has no resources of its own.  Google conveniently volunteers the labor of its employees to perform duties related to community business, but these activities will naturally reflect the interests of the employer, Google.  Since other firms don’t have the same financial incentives that Google has through its market dominance of search and smartphones in the outcomes of schema.org decisions, they don’t allocate their employees to spend time on schema.org issues.  Google corners the discussion while appearing to be the most generous contributor.

The absence of governance in the schema.org community

The schema.org community essentially has zero governance — a situation Google is happy with.  There are no formal rules, no formal process for proposals and decisions, no way to appeal a decision, and no formal roles apart from the chair, who ultimately can decide everything. There’s no process of recusal.  Google holds sway in part because the community has no permanent and independent staff.  And there’s no independent board of oversight reviewing how business is conducted.

It’s tempting to see the absence of governance as an example of a group of developers who have a disdain for bureaucracy — that’s the view Google encourages.  But the commercial and social significance of these community decisions is enormous and shouldn’t be cloaked in capricious informality.  Moreover, the more mundane problems of a lack of process are also apparent.  Many people who attempt to make suggestions feel frozen out and unwelcome. Suggestions may be challenged by core insiders who have deep relationships with one another.  The standards- making process itself lacks standardization.  

 In the absence of governance, the possibilities of a conflict of interest are substantial.  First, there’s the problem of self-dealing: Google using its position as the chair of a public forum to prioritize its own commercial interests ahead of others.  Second, there’s the possibility that non-Google proposals will be stopped because they are seen as costly to Google, if only because they create extra work for the largest single user of schema.org structured data.  

As a public company, Google is obligated to its shareholders — not to larger community interests.  A salaried Google employee can’t simultaneously promote his company’s commercial interests and promote interests that could weaken his company’s competitive position.  

Community bias in decisions

Few people want an open W3C community to exhibit biases in their decisions.  But owing to Google’s outsized participation and the absence of governance, decision making that’s biased toward Google’s priorities is common.

Whatever Google wants is fast-tracked — sometimes happening within a matter of days.  If a change to schema.org is needed to support a Google product that needs to ship, nothing will slow down that from happening.

 Suggestions from people not affiliated with Google face a tougher journey.  If the suggestion does not match Google priorities, it is slow-walked. They will be challenged as to their necessity or practicality.  They will languish as an open issue in Github, where they will go unnoticed unless they generate an active discussion.  Eventually, the chair will cull proposals that have been long buried in the interest of closing out open issues.

While individuals and groups can propose suggestions of their own, successful ones tend to be incremental in nature, already aligned with Google’s agenda.  More disruptive or innovative ones are less likely to be adopted.

In the absence of a defined process, the ratification of proposals tends to happen through an informal virtual acclamation.  Various Google employees will conduct a public online discussion agreeing with one another on the merits of adopting a proposal or change.  With “community” sentiment demonstrated, the change is pushed ahead.  

Consumer harm from Google’s capture of schema.org

Google’s domination of schema.org is an essential part of its business model.  Schema.org structured data drives traffic to Google properties, and Google has leveraged it so that it can present fewer links that would drive traffic elsewhere.  The more time consumers spend on Google properties, the more their information decisions are limited to the ads that Google sells.  Consumers need to work harder to find “organic” links (objectively determined by their query and involving no payment to Google) to information sources they seek.

A technical standard should be a public good that benefits all.  In principle, publishers that use schema.org metadata should be able to expand the reach of their information, so that apps from many firms take advantage of it, and consumers have more choices about how and where they get their information.  The motivating idea behind semantic structured data such as schema.org provides is that information becomes independent of platforms.  But ironically, for consumers to enjoy the value of structured data, they mostly need to use Google products.  This is a significant market failure, which hasn’t happened by accident.

The original premise of the semantic web was based on openness.  Publishers freely offered information, and consumers could freely access it.  But the commercial version, driven by Google, has changed this dynamic.  The commercial semantic web isn’t truly open; it is asymmetrically open.  It involves open publishing but closed access.  Web publishers are free to publish their data using the schema.org standard and are actively encouraged to do so by Google. The barriers to creating structured data are minimal, though the barriers to retrieving it aren’t.  

Right now, only a firm with the scale of Google is in a position to access this data and normalize it into something useful for consumers.  Google’s formidable ad revenues allow it to crawl the web and harvest the data for its private gain.  A few other firms are also harvesting this data to build private knowledge graphs that similarly provide gated access.  The goal of open consumer access to this data remains elusive.  A small company may invest time or money to create structured data, but they lack the means to use structured data for their own purposes.   But it doesn’t have to be this way.   

Has Google’s domination of schema.org stifled innovation?

When considering how big tech has influenced innovation, it is necessary to pose a counterfactual question: What might have been possible if the heavy hand of a big tech platform hadn’t been meddling?

Google’s routine challenge to suggestions for additions to the schema.org vocabulary is to question whether the new idea will be used.  “What consuming application is going to use this?” is the common screening question.  If Google isn’t interested in using it, why is it worthwhile doing?  Unless the individual making the suggestion is associated with a huge organization that will build significant infrastructure around the new proposal, the proposal is considered unviable.  

The word choice of “consuming applications” is an example of how Google avoids referring to itself and its market dominance.  The Markup recently revealed how Google coaches its employees to avoid phrases that could get it in additional antitrust trouble.  Within the schema.org community group, Google employees strive to make discussion appear objective, where Google seems disinterested in the decision.  

One area where Google has discouraged alternative developments is in discouraging the linking of schema.org data with data using other metadata vocabularies (standards).  This is significant for multiple reasons.  The schema.org vocabulary is limited in its scope, mostly focusing on commercial entities and not on non-commercial entities.  Because Google is not interested in non-commercial entity coverage, publishers need to rely on other vocabularies.  But Google doesn’t want to look at other vocabularies, claiming that it is too taxing for them to crawl data described by other vocabularies.  In this, Google is making a commercial decision that goes against the principles of linked data (a principle of the semantic web), which explicitly encourages the mixing of vocabularies. For publishers, they are forced to obey Google’s diktats.  Why should they supply metadata that Google, the biggest consumer of schema.org metadata, says it will ignore?  With a few select exceptions, Google mandates that only schema.org metadata should be used in web content and no other semantic vocabularies.  Google sets the vision of what schema.org is, and what it does.

To break this cycle, the public should be asking: How might consumers access and utilize information from the information commons without relying on Google?

There are several paths possible.  One might involve opening up the web crawl to wider use by firms of all sizes.  Another would be to expand the role of the schema.org vocabularies in APIs to support consumer apps.  Whatever path is pursued, it needs to be attractive to small firms and startups to bring greater diversity to consumers and spark innovation.

Possibilities for reform: Getting Google out of the way

If schema.org is to continue as a W3C community and associated with the trust conferred by that designation, then it will require serious reform.  It needs governance — and independence from Google.  It may need to transform into something far more formal than a community group.  

In its current incarnation, it’s difficult to imagine this level of oversight.  The community is resource-starved, and relies on Google to function. But if schema.org isn’t viable without Google’s outsized involvement, then why does it exist at all?  Whose community is it?

There’s no rationale to justify the W3C  lending its endorsement to a community that is dominated by a single company.  One solution is for schema.org to cease being part of the W3C umbrella and return to its prior status of being a Google-sponsored initiative.  That would be the honest solution, barring more sweeping changes.

Another option would be to create a rival W3C standard that isn’t tied to Google and therefore couldn’t be dominated by it, but a standard Google couldn’t afford to ignore.  That would be a more radical option, involving significant reprioritization by publishers.  It would be disruptive in the short term, but might ultimately result in greater innovation.  A starting point for this option would be to explore how to popularize Wikidata as a general-purpose vocabulary that could be used instead of schema.org.

A final option would be for Google to step up in order to step down.  They could acknowledge that they have benefited enormously from the thousands of webmasters and others who contribute structured data and they owe a debt to them.  They could offer to payback in kind.  Google could draw on the recent example of Facebook’s funding of an independent body that will provide oversight that company.  Google could fund a truly independent body to oversee schema.org, and financially guarantee the creation of a new organizational structure.   Such an organization would leave no questions about how decisions are made and would dispel industry concerns that Google is gaining unfair advantages.  Given the heightening regulatory scrutiny of Google, this option is not as extravagant as it may first sound.

On a pragmatic level, I would like to see schema.org realize its full potential.  This issue is important enough to merit broader discussion, not just in the narrow community of people who work on web metadata, but those involved with regulating technology and looking at antitrust.  Google spends considerable sums, often furtively, hiring academic experts and others to dismiss concerns about their market dominance.  The role of metadata should be to make information more transparent.  That’s why this matters in many ways.

— Michael Andrews

Clarification: schema.org’s status as a  north star and as a standard (August 12)

The welcome page of schema.org notes it is “developed by an open community process, using the public-schemaorg@w3.org mailing list.”   When I first published this post I referred to schema.org as an “open W3C metadata standard.” Dan Brickley of Google tweeted to me and others stating that I made a “simple factual error” doing so. He is technically correct that my characterization of the W3C’s role is not precise, so I have changed the wording to say “a W3C-sanctioned metadata standard” instead (sanctioned = permitted), which is the most accurate I can manage, given the intentionally confusing nature of schema.org’s mandate.  This may seem like mincing words, but the implications are important, and I want to elaborate on what those are.

It is true that schema.org is not an official W3C standard in the sense that HTML5 is, which had a cast of thousands involved in its development.  For standards to become an official W3C standard, they need to go through a long process of community vetting, moving through stages such as being a recommendation first.  Even a recommendation is not yet an official standard, though it is widely followed.  Just because technical guidelines aren’t official W3C standards or are even referred to as standards does not mean they don’t have the effect of a standard that others would be expected to follow in order to gain market acceptance. Standards vary in the degree they are voluntary — schema.org has always been a voluntary standard.  And there are different levels of standards maturity within the W3C’s standards making framework, with the most mature ones reflecting the most stringent levels of compliance.  A W3C community group discussions around standards proposals would be the least rigorous and normally associated with the least developed stage of standards activity.  It is typically associated with new ideas for standards, rather than well-formed standards that are widely used by thousands of companies.  

A key difference with the schema.org community group is that is hosts discussions about a fully-formed standard.  This standard was fully formed before there was ever a community group to discuss it.  In other words, there was never any community input on the basic foundation of schema.org.  Google decided this together with its partners in the schema.org initiative.  

So I agree that schema.org fails to satisfy the expectations of a W3C standard.  The W3C has a well-established process for standards, and schema.org’s governance doesn’t remotely align with how a W3C standard is developed.  

The problem is that by having fully-formed standard being discussed in a W3C forum, it appears as if schema.org is a W3C standard of some sort.  Appearances do matter. Webmasters on the W3C mailing list can reasonably assume the W3C endorses schema.org.  And by hosting a community group on schema.org, the W3C has lent support to schema.org.  To outsiders, they appear to be sponsoring its development and one would presume be interested in having open participation in decision making about it.  The terms of service for schema.org treat “the schemas published by Schema.org as if the schemas were W3C Recommendations.”  The optics of schema.org imply it is W3C-ish.  

Dan Brickley refers to schema.org as a “independent project” and not a “W3C thing.”  I’m not reassured by that characterization, which is the first I’ve heard Google draw explicit distance from W3C affiliation.  He seems to be rejecting the notion that the W3C should provide any oversight over the schema.org process.  They’re merely providing a free mailing list.  The four corporate “sponsors” of schema.org set the binding conditions of the terms of service.  Nothing schema.org is working on is intended to become an official W3C standard and hence subject to W3C governance.  

Even though 10 million websites use schema.org metadata and are affected by its decisions, schema.org’s decision making is tightly held.   Ultimate decision making authority rests with a Steering Committee (also chaired by Google) that is invitation-only and not open to public participation.  Supposedly, a W3C representative is allowed to sit on this committee, though the details about this, like much else in schema.org’s documentation, are unclear.   

It may seem reassuring to imagine that schema.org belongs to a nebulous entity called the “community,” but that glosses over how much of the community activities and decisions are Google-driven. Google does draw on the expertise and ideas of others, so that schema.org is more than one company’s creation.  But in the end, Google keeps tight control over the process so that schema.org reflects its priorities.  It would be simpler to call this the Google Structured Data schema.  

 Schema.org appears to be public and open, while in practice is controlled by a small group of competitors and one firm in particular. Google is having its cake and eating it too.  If schema.org does not want W3C oversight, then the W3C should disavow having a connection with them, and help to reduce at least some of the confusion about who is in control of schema.org.  

Categories
Content Engineering

Lumping and Splitting in Taxonomy

Creating a taxonomy — noting which distinctions matter — often seems more art than science.  I’ve been interested in how to think about taxonomy more globally, instead of looking at it as a case-by-case judgment call.  Part of my interest here is a spin off from my interest of birding.  I’m no ornithologist, but I try to learn what I can about the nature of birds.  And species of birds, of course, are classified according to a taxonomy.  

The taxonomy for birds is among the most rigorous out there.  It is debated and litigated, sometimes over decades.  The process involves a progression of “lumps” and “splits” that recalibrate which distinctions are considered significant.  Recently the taxonomy underwent a major revision that reordered the kingdom of birds. 

In the mid-2010s, scientists changed the classification of birds to consider not only anatomical features, but DNA.  In the new ordering, eagles and falcons are not as closely related as was previously assumed. Eagles are closer to vultures, while falcons are closer to parrots.  And pigeons and flamingos are more closely related than thought previously.  Appearance alone is not enough on which to base similarity.

More closely related than you might think (Both produce milk to feed their young)

Taxonomy and Information Technology

Taxonomy doesn’t receive the attention it deserves in the IT world.  It seems subjective: vague, hard to predict, potentially the source of arguments.  Taxonomy resembles content: it may be necessary, but it is something to work around — “place taxonomy here when ready.”

But taxonomy can’t be avoided. Even though semantic technologies are becoming richer in describing the characteristics of entities, the properties of entities alone may not be enough to distinguish between types of entities.  Many entities share common properties, and even common values, so it becomes important to be able to indicate what type of entity something is.  We can describe something in terms of its physical properties such as weight, height, color and so on, and still have no idea what it is we are describing.  It can resemble the parlor game of twenty questions: a prolonged discourse that’s prone to howlers.

Classification is the bedrock of algorithms: they drive automated decisions.  Yet taxonomies are human designed.  Taxonomies lack the superficial impartiality of machine-oriented linked data or machine learning classification.  But taxonomies are useful because of their perceived limitations. They require human attention and human judgment.  That helps make data more explainable.  

Humans decide taxonomies — even when machines provide assistance finding patterns of similarity. Users of taxonomies need to understand the basis of similarity.  No matter how experienced the taxonomist or sophisticated the text analysis, the basis of a taxonomy should be explainable and repeatable ideally.  Machine-driven clustering approaches lack these qualities.  

To be durable, a taxonomy needs a reasoned basis and justification.  Business taxonomies can borrow ideas from scientific taxonomies.   

Four approaches can us help decide how to classify categories:

  1. Homology
  2. Analogy
  3. Differentia
  4. Interoperability

Homology and analogy deal with “lumping” — finding commonality among different items.  Differentia and interoperability help define “splitting” — where to break out similar things.

Homology: Discovering shared origins

Homology is a phrase taxonomists use to describe when features, while appearing different, have a common origin and original intent.  For example, mammals have limbs, but the limb could be manifested as an arm or as a flipper.  

Homology refers to cases where things start the same but go in different directions.  It can get at the core essence of a feature: what it enables, without worrying so much how it appears or precisely what it does.  Homology is helpful to find larger categories that link together different things.

There are two ways we can use homology when creating a taxonomy. 

First, we can look at the components or features of items.  We look for what they share in common that might suggest a broader capability to pay attention to.  Lots of devices have embedded microprocessors, even though these devices play different roles in our lives.  Microprocessors provide a common set of capabilities of that even allow different kinds of items to interact with one another, such as in the case of the Internet of Things (IoT).  Homology is not limited to physical items.  Many business models get copied and modified by different industries, but they share common origins and drivers. We can speak of a class of businesses using an online subscription model, for example.

Second, we can consider whole items and how they are used.  Homology can be useful when a distinct thing has more than one use, especially when it doesn’t have a single primary purpose.  Baking soda is advertised as having many purposes and some consumers like products that contain baking soda.  Here we have a category of baking soda-derived products.  In the kitchen, there are many small appliances that have a rotator on which one can attach implements.  They may be called a food processor, a blender, a mixer, or some trademarked proprietary name.  What can they do?  Many tasks: chopping vegetables, making dough, making soups, smoothies, spreads…the list is endless.  But the most seem to be about pulverizing and mixing ingredients.  It’s a broad class of gadgets that share many capabilities, though they scatter in what they offer as they seek to differentiate themselves.

But there’s another approach to lumping things: analogy.    

Analogy: Discovering shared functions

We use analogies all the time in our daily conversation.  Taxonomists focus on what analogies reveal.  

Analogy helps identify things that are functionally similar, and might share a category as a result.

Analogy is the opposite of homology. With analogy, two things start from a different place, but produce a similar result.  For example, the wings of bees and wings of birds are analogous.  They are similar in their function, but different in their origin and details.  Analogies capture common affordances: where different things can be used in similar ways

Analogies are most useful when defining mental categories, such as devices to watch video, or places to go on a first date.  It’s the most subjective kind of taxonomy: different people need to hold similar views in order for these categories to be credible.

Contrasting homology and analogy, we can see two concepts, which represent notions of convergence (from differences to similarity) and divergence (from similarity to differences).

The other end of taxonomy is not about lumping things into broader categories, but splitting them into smaller ones.

Differentia: Defining Segments

Taxonomists talk about differentia (Latin for difference), which is broadly similar to what marketers refer to as segmentation.

Aristotle defined humans as animals capable of articulated speech. His formulation provided a structural pattern still used in taxonomy today:

  • A species equals a genus plus differentia

That is, the differences within a genus define individual species.  

To put it in more general terms: 

  • A segment is a group plus its distinguishing characteristics (its epithet)

A group gets divided into segments based on distinguishing characteristics.  The differentia separates members from other members.  

One of the most popular marketing segmentations relates to generational differences. In the United States, people born after the Second World War are segmented into 4 groups by age.  Other countries use similar segments, but it is not a universal segmentation so I will focus specifically on US nationals.  A common segmentation (with the exact years sometimes varying slightly) is:

  • Generation W (aka “Boomers”): American nationals born between 1946 and 1964
  • Generation X: American nationals born between 1965 and 1980
  • Generation Y (aka “Millennials”): American nationals born between 1981 and 1996
  • Generation Z: American nationals born since 1997

Such segmentation has the virtue of creating category segments that are comprehensive (no item is without a category) and mutually exclusive (no item belongs to more than one category).  It’s clean, though it is not necessarily correct — in the sense that the categories identify what most matters.  

Segments won’t be valuable if the distinctions on which they are based aren’t that important.  A segment could comprise things with a common characteristic that are otherwise quite diverse.  It’s possible for segment to be designed around an incidental characteristic that makes different things seem similar.

The point of differentia is to represent a defining characteristic. Differentia is valuable when it helps us think through which distinctions matter and are valid.  For example, we might segment people by eye color.  But that hardly seems an important way to segment people. Such segmentation encourages us to refine the group we are segmenting.  Eye color is of interest to makers of tinted contact lenses.  But even then, eye color is not a defining characteristic of a potential contact lens customer, even if were a relevant one.

While differentia can be hard to define durably, it can play a useful role in taxonomies.  It seems reasonable to segment aircraft according to the number of passengers they carry, for example.  It can capture one key aspect that represents many important issues.

Interoperability: Distinctions within commonality

A related issue is deciding when things are similar enough to say they are the same, and when we can say they are related but different.

Our final perspective comes from nature. The similarity of species is partly defined by their ability to mate.  Some closely related species of birds, for example, will cross breed.  Other pairs of less similar species lack that ability.  

A similar situation exists with languages.  Where are the distinctions and boundaries between similar languages? And when are differences just dialects and not actually different languages?  In language, mutual-intelligibility plays a role.  (Language also involves convergence and divergence — but we’ll consider their interoperability here).

The presence or absence of connection between distinct things is associated with two overlapping but distinct concepts: 

  1. Interoperability 
  2. Substitution

Both these concepts address ways in which distinct things might be consider the “same.”

Interoperability is most often associated with technology, though it can be applied to other areas, for example, cultural norms such as religions as well.  The presence of interoperability — the ability of distinct things to connect together easily because they follow a common standard or code of operation — is an indication of their similarity.  If things interoperate — they require no change in set up to work together — then they belong to the same “family,” even if the things come from different sources. The absence of interoperability is a sign that these things may not belong together and need to be split.   

Being part of the same family does not imply they are the same.   Any distinctions would relate to the role of each thing in the family (same family, different roles).   Things that follow the same standard may be similar (same role), or they may be complements (different roles).  

If things can be substituted — they are interchangeable but require a different set up to use — they may belong to the same category, but that category may need to be broken down further.  Windows, Linux and MacOS computers can be substituted with one another  — they serve the same role — so they belong to the broader personal computer category (same role, different families).  But they are separate categories because they don’t interoperate.

The value of taxonomies

Defining taxonomies is not easy.  Interpretation is needed to spot the differences that make a difference. We can improve the discovery process by using heuristic perspectives for lumping and splitting. 

Taxonomy is valuable because it can provide a succinct way to express the significance of an entity in relation to another entities.  Sometimes we need a quick summary to boil down the essence of a thing: what’s distinctive about it, so we can see how it relates to a given situation.  Taxonomies help us overcome the fragmentation of information.  

— Michael Andrews