Lumping and Splitting in Taxonomy

Creating a taxonomy — noting which distinctions matter — often seems more art than science. I’ve been interested in how to think about taxonomy more globally, instead of looking at it as a case-by-case judgment call. Part of my interest here is a spin off from my interest of birding. I’m no ornithologist, but I try to learn what I can about the nature of birds. And species of birds, of course, are classified according to a taxonomy.

The taxonomy for birds is among the most rigorous out there. It is debated and litigated, sometimes over decades. The process involves a progression of “lumps” and “splits” that recalibrate which distinctions are considered significant. Recently the taxonomy underwent a major revision that reordered the kingdom of birds.

In the mid-2010s, scientists changed the classification of birds to consider not only anatomical features, but DNA. In the new ordering, eagles and falcons are not as closely related as was previously assumed. Eagles are closer to vultures, while falcons are closer to parrots. And pigeons and flamingos are more closely related than thought previously. Appearance alone is not enough on which to base similarity.

More closely related than you might think (Both produce milk to feed their young)

Taxonomy and Information Technology

Taxonomy doesn’t receive the attention it deserves in the IT world. It seems subjective: vague, hard to predict, potentially the source of arguments. Taxonomy resembles content: it may be necessary, but it is something to work around — “place taxonomy here when ready.”

But taxonomy can’t be avoided. Even though semantic technologies are becoming richer in describing the characteristics of entities, the properties of entities alone may not be enough to distinguish between types of entities. Many entities share common properties, and even common values, so it becomes important to be able to indicate what type of entity something is. We can describe something in terms of its physical properties such as weight, height, color and so on, and still have no idea what it is we are describing. It can resemble the parlor game of twenty questions: a prolonged discourse that’s prone to howlers.

Classification is the bedrock of algorithms: they drive automated decisions. Yet taxonomies are human designed. Taxonomies lack the superficial impartiality of machine-oriented linked data or machine learning classification. But taxonomies are useful because of their perceived limitations. They require human attention and human judgment. That helps make data more explainable.

Humans decide taxonomies — even when machines provide assistance finding patterns of similarity. Users of taxonomies need to understand the basis of similarity. No matter how experienced the taxonomist or sophisticated the text analysis, the basis of a taxonomy should be explainable and repeatable ideally. Machine-driven clustering approaches lack these qualities.

To be durable, a taxonomy needs a reasoned basis and justification. Business taxonomies can borrow ideas from scientific taxonomies.

Four approaches can us help decide how to classify categories:

Homology
Analogy
Differentia
Interoperability

Homology and analogy deal with “lumping” — finding commonality among different items. Differentia and interoperability help define “splitting” — where to break out similar things.

Homology: Discovering shared origins

Homology is a phrase taxonomists use to describe when features, while appearing different, have a common origin and original intent. For example, mammals have limbs, but the limb could be manifested as an arm or as a flipper.

Homology refers to cases where things start the same but go in different directions. It can get at the core essence of a feature: what it enables, without worrying so much how it appears or precisely what it does. Homology is helpful to find larger categories that link together different things.

There are two ways we can use homology when creating a taxonomy.

First, we can look at the components or features of items. We look for what they share in common that might suggest a broader capability to pay attention to. Lots of devices have embedded microprocessors, even though these devices play different roles in our lives. Microprocessors provide a common set of capabilities of that even allow different kinds of items to interact with one another, such as in the case of the Internet of Things (IoT). Homology is not limited to physical items. Many business models get copied and modified by different industries, but they share common origins and drivers. We can speak of a class of businesses using an online subscription model, for example.

Second, we can consider whole items and how they are used. Homology can be useful when a distinct thing has more than one use, especially when it doesn’t have a single primary purpose. Baking soda is advertised as having many purposes and some consumers like products that contain baking soda. Here we have a category of baking soda-derived products. In the kitchen, there are many small appliances that have a rotator on which one can attach implements. They may be called a food processor, a blender, a mixer, or some trademarked proprietary name. What can they do? Many tasks: chopping vegetables, making dough, making soups, smoothies, spreads…the list is endless. But the most seem to be about pulverizing and mixing ingredients. It’s a broad class of gadgets that share many capabilities, though they scatter in what they offer as they seek to differentiate themselves.

But there’s another approach to lumping things: analogy.

Analogy: Discovering shared functions

We use analogies all the time in our daily conversation. Taxonomists focus on what analogies reveal.

Analogy helps identify things that are functionally similar, and might share a category as a result.

Analogy is the opposite of homology. With analogy, two things start from a different place, but produce a similar result. For example, the wings of bees and wings of birds are analogous. They are similar in their function, but different in their origin and details. Analogies capture common affordances: where different things can be used in similar ways

Analogies are most useful when defining mental categories, such as devices to watch video, or places to go on a first date. It’s the most subjective kind of taxonomy: different people need to hold similar views in order for these categories to be credible.

Contrasting homology and analogy, we can see two concepts, which represent notions of convergence (from differences to similarity) and divergence (from similarity to differences).

The other end of taxonomy is not about lumping things into broader categories, but splitting them into smaller ones.

Differentia: Defining Segments

Taxonomists talk about differentia (Latin for difference), which is broadly similar to what marketers refer to as segmentation.

Aristotle defined humans as animals capable of articulated speech. His formulation provided a structural pattern still used in taxonomy today:

A species equals a genus plus differentia

That is, the differences within a genus define individual species.

To put it in more general terms:

A segment is a group plus its distinguishing characteristics (its epithet)

A group gets divided into segments based on distinguishing characteristics. The differentia separates members from other members.

One of the most popular marketing segmentations relates to generational differences. In the United States, people born after the Second World War are segmented into 4 groups by age. Other countries use similar segments, but it is not a universal segmentation so I will focus specifically on US nationals. A common segmentation (with the exact years sometimes varying slightly) is:

Generation W (aka “Boomers”): American nationals born between 1946 and 1964
Generation X: American nationals born between 1965 and 1980
Generation Y (aka “Millennials”): American nationals born between 1981 and 1996
Generation Z: American nationals born since 1997

Such segmentation has the virtue of creating category segments that are comprehensive (no item is without a category) and mutually exclusive (no item belongs to more than one category). It’s clean, though it is not necessarily correct — in the sense that the categories identify what most matters.

Segments won’t be valuable if the distinctions on which they are based aren’t that important. A segment could comprise things with a common characteristic that are otherwise quite diverse. It’s possible for segment to be designed around an incidental characteristic that makes different things seem similar.

The point of differentia is to represent a defining characteristic. Differentia is valuable when it helps us think through which distinctions matter and are valid. For example, we might segment people by eye color. But that hardly seems an important way to segment people. Such segmentation encourages us to refine the group we are segmenting. Eye color is of interest to makers of tinted contact lenses. But even then, eye color is not a defining characteristic of a potential contact lens customer, even if were a relevant one.

While differentia can be hard to define durably, it can play a useful role in taxonomies. It seems reasonable to segment aircraft according to the number of passengers they carry, for example. It can capture one key aspect that represents many important issues.

Interoperability: Distinctions within commonality

A related issue is deciding when things are similar enough to say they are the same, and when we can say they are related but different.

Our final perspective comes from nature. The similarity of species is partly defined by their ability to mate. Some closely related species of birds, for example, will cross breed. Other pairs of less similar species lack that ability.

A similar situation exists with languages. Where are the distinctions and boundaries between similar languages? And when are differences just dialects and not actually different languages? In language, mutual-intelligibility plays a role. (Language also involves convergence and divergence — but we’ll consider their interoperability here).

The presence or absence of connection between distinct things is associated with two overlapping but distinct concepts:

Interoperability
Substitution

Both these concepts address ways in which distinct things might be consider the “same.”

Interoperability is most often associated with technology, though it can be applied to other areas, for example, cultural norms such as religions as well. The presence of interoperability — the ability of distinct things to connect together easily because they follow a common standard or code of operation — is an indication of their similarity. If things interoperate — they require no change in set up to work together — then they belong to the same “family,” even if the things come from different sources. The absence of interoperability is a sign that these things may not belong together and need to be split.

Being part of the same family does not imply they are the same. Any distinctions would relate to the role of each thing in the family (same family, different roles). Things that follow the same standard may be similar (same role), or they may be complements (different roles).

If things can be substituted — they are interchangeable but require a different set up to use — they may belong to the same category, but that category may need to be broken down further. Windows, Linux and MacOS computers can be substituted with one another — they serve the same role — so they belong to the broader personal computer category (same role, different families). But they are separate categories because they don’t interoperate.

The value of taxonomies

Defining taxonomies is not easy. Interpretation is needed to spot the differences that make a difference. We can improve the discovery process by using heuristic perspectives for lumping and splitting.

Taxonomy is valuable because it can provide a succinct way to express the significance of an entity in relation to another entities. Sometimes we need a quick summary to boil down the essence of a thing: what’s distinctive about it, so we can see how it relates to a given situation. Taxonomies help us overcome the fragmentation of information.

— Michael Andrews