Categories
Content Engineering

Lumping and Splitting in Taxonomy

Creating a taxonomy — noting which distinctions matter — often seems more art than science.  I’ve been interested in how to think about taxonomy more globally, instead of looking at it as a case-by-case judgment call.  Part of my interest here is a spin off from my interest of birding.  I’m no ornithologist, but I try to learn what I can about the nature of birds.  And species of birds, of course, are classified according to a taxonomy.  

The taxonomy for birds is among the most rigorous out there.  It is debated and litigated, sometimes over decades.  The process involves a progression of “lumps” and “splits” that recalibrate which distinctions are considered significant.  Recently the taxonomy underwent a major revision that reordered the kingdom of birds. 

In the mid-2010s, scientists changed the classification of birds to consider not only anatomical features, but DNA.  In the new ordering, eagles and falcons are not as closely related as was previously assumed. Eagles are closer to vultures, while falcons are closer to parrots.  And pigeons and flamingos are more closely related than thought previously.  Appearance alone is not enough on which to base similarity.

More closely related than you might think (Both produce milk to feed their young)

Taxonomy and Information Technology

Taxonomy doesn’t receive the attention it deserves in the IT world.  It seems subjective: vague, hard to predict, potentially the source of arguments.  Taxonomy resembles content: it may be necessary, but it is something to work around — “place taxonomy here when ready.”

But taxonomy can’t be avoided. Even though semantic technologies are becoming richer in describing the characteristics of entities, the properties of entities alone may not be enough to distinguish between types of entities.  Many entities share common properties, and even common values, so it becomes important to be able to indicate what type of entity something is.  We can describe something in terms of its physical properties such as weight, height, color and so on, and still have no idea what it is we are describing.  It can resemble the parlor game of twenty questions: a prolonged discourse that’s prone to howlers.

Classification is the bedrock of algorithms: they drive automated decisions.  Yet taxonomies are human designed.  Taxonomies lack the superficial impartiality of machine-oriented linked data or machine learning classification.  But taxonomies are useful because of their perceived limitations. They require human attention and human judgment.  That helps make data more explainable.  

Humans decide taxonomies — even when machines provide assistance finding patterns of similarity. Users of taxonomies need to understand the basis of similarity.  No matter how experienced the taxonomist or sophisticated the text analysis, the basis of a taxonomy should be explainable and repeatable ideally.  Machine-driven clustering approaches lack these qualities.  

To be durable, a taxonomy needs a reasoned basis and justification.  Business taxonomies can borrow ideas from scientific taxonomies.   

Four approaches can us help decide how to classify categories:

  1. Homology
  2. Analogy
  3. Differentia
  4. Interoperability

Homology and analogy deal with “lumping” — finding commonality among different items.  Differentia and interoperability help define “splitting” — where to break out similar things.

Homology: Discovering shared origins

Homology is a phrase taxonomists use to describe when features, while appearing different, have a common origin and original intent.  For example, mammals have limbs, but the limb could be manifested as an arm or as a flipper.  

Homology refers to cases where things start the same but go in different directions.  It can get at the core essence of a feature: what it enables, without worrying so much how it appears or precisely what it does.  Homology is helpful to find larger categories that link together different things.

There are two ways we can use homology when creating a taxonomy. 

First, we can look at the components or features of items.  We look for what they share in common that might suggest a broader capability to pay attention to.  Lots of devices have embedded microprocessors, even though these devices play different roles in our lives.  Microprocessors provide a common set of capabilities of that even allow different kinds of items to interact with one another, such as in the case of the Internet of Things (IoT).  Homology is not limited to physical items.  Many business models get copied and modified by different industries, but they share common origins and drivers. We can speak of a class of businesses using an online subscription model, for example.

Second, we can consider whole items and how they are used.  Homology can be useful when a distinct thing has more than one use, especially when it doesn’t have a single primary purpose.  Baking soda is advertised as having many purposes and some consumers like products that contain baking soda.  Here we have a category of baking soda-derived products.  In the kitchen, there are many small appliances that have a rotator on which one can attach implements.  They may be called a food processor, a blender, a mixer, or some trademarked proprietary name.  What can they do?  Many tasks: chopping vegetables, making dough, making soups, smoothies, spreads…the list is endless.  But the most seem to be about pulverizing and mixing ingredients.  It’s a broad class of gadgets that share many capabilities, though they scatter in what they offer as they seek to differentiate themselves.

But there’s another approach to lumping things: analogy.    

Analogy: Discovering shared functions

We use analogies all the time in our daily conversation.  Taxonomists focus on what analogies reveal.  

Analogy helps identify things that are functionally similar, and might share a category as a result.

Analogy is the opposite of homology. With analogy, two things start from a different place, but produce a similar result.  For example, the wings of bees and wings of birds are analogous.  They are similar in their function, but different in their origin and details.  Analogies capture common affordances: where different things can be used in similar ways

Analogies are most useful when defining mental categories, such as devices to watch video, or places to go on a first date.  It’s the most subjective kind of taxonomy: different people need to hold similar views in order for these categories to be credible.

Contrasting homology and analogy, we can see two concepts, which represent notions of convergence (from differences to similarity) and divergence (from similarity to differences).

The other end of taxonomy is not about lumping things into broader categories, but splitting them into smaller ones.

Differentia: Defining Segments

Taxonomists talk about differentia (Latin for difference), which is broadly similar to what marketers refer to as segmentation.

Aristotle defined humans as animals capable of articulated speech. His formulation provided a structural pattern still used in taxonomy today:

  • A species equals a genus plus differentia

That is, the differences within a genus define individual species.  

To put it in more general terms: 

  • A segment is a group plus its distinguishing characteristics (its epithet)

A group gets divided into segments based on distinguishing characteristics.  The differentia separates members from other members.  

One of the most popular marketing segmentations relates to generational differences. In the United States, people born after the Second World War are segmented into 4 groups by age.  Other countries use similar segments, but it is not a universal segmentation so I will focus specifically on US nationals.  A common segmentation (with the exact years sometimes varying slightly) is:

  • Generation W (aka “Boomers”): American nationals born between 1946 and 1964
  • Generation X: American nationals born between 1965 and 1980
  • Generation Y (aka “Millennials”): American nationals born between 1981 and 1996
  • Generation Z: American nationals born since 1997

Such segmentation has the virtue of creating category segments that are comprehensive (no item is without a category) and mutually exclusive (no item belongs to more than one category).  It’s clean, though it is not necessarily correct — in the sense that the categories identify what most matters.  

Segments won’t be valuable if the distinctions on which they are based aren’t that important.  A segment could comprise things with a common characteristic that are otherwise quite diverse.  It’s possible for segment to be designed around an incidental characteristic that makes different things seem similar.

The point of differentia is to represent a defining characteristic. Differentia is valuable when it helps us think through which distinctions matter and are valid.  For example, we might segment people by eye color.  But that hardly seems an important way to segment people. Such segmentation encourages us to refine the group we are segmenting.  Eye color is of interest to makers of tinted contact lenses.  But even then, eye color is not a defining characteristic of a potential contact lens customer, even if were a relevant one.

While differentia can be hard to define durably, it can play a useful role in taxonomies.  It seems reasonable to segment aircraft according to the number of passengers they carry, for example.  It can capture one key aspect that represents many important issues.

Interoperability: Distinctions within commonality

A related issue is deciding when things are similar enough to say they are the same, and when we can say they are related but different.

Our final perspective comes from nature. The similarity of species is partly defined by their ability to mate.  Some closely related species of birds, for example, will cross breed.  Other pairs of less similar species lack that ability.  

A similar situation exists with languages.  Where are the distinctions and boundaries between similar languages? And when are differences just dialects and not actually different languages?  In language, mutual-intelligibility plays a role.  (Language also involves convergence and divergence — but we’ll consider their interoperability here).

The presence or absence of connection between distinct things is associated with two overlapping but distinct concepts: 

  1. Interoperability 
  2. Substitution

Both these concepts address ways in which distinct things might be consider the “same.”

Interoperability is most often associated with technology, though it can be applied to other areas, for example, cultural norms such as religions as well.  The presence of interoperability — the ability of distinct things to connect together easily because they follow a common standard or code of operation — is an indication of their similarity.  If things interoperate — they require no change in set up to work together — then they belong to the same “family,” even if the things come from different sources. The absence of interoperability is a sign that these things may not belong together and need to be split.   

Being part of the same family does not imply they are the same.   Any distinctions would relate to the role of each thing in the family (same family, different roles).   Things that follow the same standard may be similar (same role), or they may be complements (different roles).  

If things can be substituted — they are interchangeable but require a different set up to use — they may belong to the same category, but that category may need to be broken down further.  Windows, Linux and MacOS computers can be substituted with one another  — they serve the same role — so they belong to the broader personal computer category (same role, different families).  But they are separate categories because they don’t interoperate.

The value of taxonomies

Defining taxonomies is not easy.  Interpretation is needed to spot the differences that make a difference. We can improve the discovery process by using heuristic perspectives for lumping and splitting. 

Taxonomy is valuable because it can provide a succinct way to express the significance of an entity in relation to another entities.  Sometimes we need a quick summary to boil down the essence of a thing: what’s distinctive about it, so we can see how it relates to a given situation.  Taxonomies help us overcome the fragmentation of information.  

— Michael Andrews

Categories
Content Experience

Three Perspectives on Content Identity

If you put two things together side-by-side, what do they have in common? The answer depends on the point of view.  Alternative viewpoints mold content identity differently. Designers of content experiences, such as content strategists and information architects, can use these viewpoints to surface different kinds of content relationships.

Three actors shape the identity of content: the author or curator; the audience; and the thing or things discussed in the content. Each brings its own perspective to what content is about:

  • Content identity as interpreted by an author or curator
  • Content identity as interpreted by the audience
  • Content about things that reveal dimensions of themselves

Each perspective plays a different role in framing the content experience.

Scene setting: the Curatorial Perspective

Scene setting lets people understand common themes in content that aren’t obvious. An author or curator draws on their unique knowledge to construct a theme that unifies different content items. Such themes set expectations about the relationship of content to other content. It is didactic in orientation.

A common label used to announce a theme is the series — for instance, a TV series, or a narrative trilogy. Sometimes the series is just a way to divide up something into smaller parts, but keep them connected: an article becomes a two-part article.  A content series can express how different items are related according to the intentions of the author or the interpretation of a curator. They can be a sequence of items presented on a common theme. The series may present the evolution of the item over time, such as versions. A building architect might show a series of images starting with a sketch, then a foam model, and finally a photo of the finished building.

A series presents a collection of items and shows how they belong together.  The author/curator draws on their intimate knowledge of the content to point out connections between different content items, which may not be self-evident. We find this in the museum world: an item presented is said to originally belong with other items, that have since been dispersed. A curator might indicate how several items embody a common theme, such as when similar paintings express a recurrent motif.

Art curators identify series of related Van Gogh paintings (via Wikipedia). These three are more similar than others he painted on the same subject.
Art curators identify series of related Van Gogh paintings (via Wikipedia). These three are more similar than others he painted on the same subject.

Any time items are defined by the values and judgments of the author (or curator), the audience must be willing to accept that valuation as relevant.  So if a curator identifies items as “new and notable,” then the intended audience needs to buy that labeling.

Mirroring: the Audience Perspective

When mirroring, content reflects themes as seen by the audience.  It represents concepts the way audiences think about them to support attraction to the content.  Mirroring is different from the authorial perspective, which expresses the content’s intention.  The audience perspective expresses how content is imagined.

Brand names are perhaps the purest example of imagined content.  Brands have no intrinsic identity: they depend entirely on the perceptions of customers to define what they mean.  Even a conglomerate that sells many brand products can’t dictate how consumers view these brands.  The French brand house LVHM, which sells numerous luxury brand products, can’t control whether consumers consider Dior is more similar to Givenchy or to Louis Vuitton, even though it owns all three brands. In reality, Chinese consumers may have different opinions about these relationships than Italian consumers would.

Part of a dendrogram showing perceived similarities between different luxury brands, from a study at Woosuk University in Korea
Part of a dendrogram showing perceived similarities between different luxury brands, from a study at Woosuk University in Korea

High-level concepts that are meaningful to audiences should reflect how audiences perceive them. For example, people associate different kinds of experiences with different vacation activities. Is bungee jumping active-fun, adventurous, or extreme? It is best to work with the audiences’ framework of values, rather than trying to impose one on them. Card sorting is useful for eliciting subjective perceptions about the identity of things.  Yet card sorting is less reliable when defining the identity of concrete things, since it shifts the attention away from the object’s specific properties. Better, more empirical approaches are available to classify concrete items.

Discovery: Perspectives based on Item Properties

Features of items can suggest themes. Object-defined themes let the things featured in the content to speak for themselves. This involves more showing, and less telling. Properties can define identities, and reveal commonalities between different items. It promotes discovery of content relationships.

Faceted search interfaces, such as found on e-commerce sites, are the most familiar implementation of property-driven identification. People choose values for various facets (properties) of items, and get a list of items matching these values.  Using properties to identify items is especially valuable for non-text content. Some Digital Asset Management systems allow people to find images that match a certain shade of a color, regardless of what the subject of the image is.  Properties can identify similarities and relationships that might not be expected from a higher level label.   It can support more criteria-based consideration of identity.  For example, when we think of travel items — things to pack — we generally have standard things in mind: toiletries, articles of clothing, etc.  But if we start with properties, the universe of travel items expands.  We might define travel items as things that are both small and lightweight.  We discover small and lightweight versions of things we might not ordinarily pack for travel, but might enjoy having once we become aware of the option.

Generative classification of objects according to properties by P Harni, via Aalto University
Generative classification of objects according to properties by P Harni, screenshot via Aalto University

Leveraging Diverse Viewpoints

There’s more than one way to define the relationship between items of content. I sometimes see people try to make a single hierarchical taxonomy serve as both an authoritative or objective classification of content, and a user-centric classification that reflects the subjective perceptions of users, without realizing they are forcing together different kinds of content identities — one relatively stable, the other contextual and subject to change.

Content can be considered objectively as it is; authoritatively as it is intended; and subjectively as it seems to various audiences. These differences offer thematic lenses for looking at content. They can be used to help audiences connect different items of content together in different ways: setting the scene for audiences so they understand relationships better, reflecting their existing attitudes to promote attraction to items of interest, and helping them discover things they didn’t know.

— Michael Andrews