Categories
Content Engineering

Revisiting the difference between content and data

This post looks in detail at the differences between content and data.  

 TL;DR — I understand you’re too busy to read a 10,000+ word post.  No worries, I’ve pulled out some highlights in the below table.  If this data doesn’t answer your questions, you may have to read further.

ContentData
Open-domainClosed domain
No restrictions on expressionExpression is restricted
Has an intentDoesn’t have an intent
Has an authorOften anonymous
Complex valuesSingle, unambiguous values
Topics and narrativesEntities
Structure builds resourcesStructure defines boundaries
Facts discussed outside of pre-defined relationshipsFacts described through pre-defined relationships
Assembly structure is coupledAssembly structure is decoupled
Editorial compositionLogical composition
Has a specific audienceIs universally relevant
Nuanced in meaningMeaning is standardized
Focused on audience interestFocused on resource reusability
Meaning can be independent of contextMeaning is dependent on context
Build up meaningBreak down meaning
The bigger themeThe details
Often proprietaryOften public domain
Uniqueness is valuedRegularity is valued
Cares about audience attentionDoesn’t care about audience attention
Some differences between content and data

What’s the issue and why does it matter?

New approaches to managing content and data such as headless content management and knowledge graphs are altering how we work with digital resources on a technical level.  Those developments have made it even more important to address what humans need when it comes to content and data, especially since the needs of people so far have not been at the forefront of these technological developments.  Too often, the resource is considered more valuable than the people who might use the resource.  It’s time to define future approaches to accessing content and data in a more human-centered way.   The first step is to be clear about the differences between how people use content and data.    

While the difference between content and data may seem like an idle philosophical question, it goes to the heart of how we conceptualize and imagine our use of resources online. Probing the distinction allows us to examine our sometimes unconscious assumptions about the value of different resources.  Content and data are basic building blocks for communication and understanding.  Yet I’ve been surprised how differently professionals think about their respective value, potential, and limitations.  My thinking about this topic has evolved as well, as I watch changes in the possibilities for working with both but also encounter sometimes idealized notions about what they can accomplish.  

Our mental model of digital resources influences how to work with and value them.  Many people consider content and data as different things, even if they can’t delineate precisely how they differ.  They often have different perspectives of the value of each.  Some popular tropes illustrate this.  Data is the “new oil” — the value of content is to generate or extract data.  Content is “king” (or queen) – data exists to support content.   In both these perspectives, content or data are seen as raw material — a means to an end — though they disagree on what the end is.   

When we talk about content and data, we rarely define what these terms mean. This situation has been true since the early days of the web.  We’ve made little progress in understanding what various digital resources mean to people and the picture keeps getting more complex.  Content and data are becoming more intertwined, but they aren’t necessarily converging.  

On a technical level, various mental models people have about resources get translated into formal schemas.  How we architect resources influences how people can use them. 

Are content and data different categories of resources?

Professionals who work with digital resources in different roles don’t share the same understanding of how content relates to data.  Until recently that wasn’t a big problem, because content and data lived in separate silos.  People who worked with content could comfortably ignore the details of data, and those with a data focus were indifferent to content.  

Outside of those who create and manage content, content is still largely ignored in the IT world.  There’s always been more of a focus on information or knowledge.  When you avoid discussing content and understanding its role, you can slip into a shaky discussion about delivering information or providing “knowledge” to users without considering what audiences actually need.  

But the historical silos between content and data are slowly coming down, and it’s becoming more important to understand how content and data are related.  Unfortunately, it’s not so simple to express their differences succinctly, because they vary in many different dimensions.  They aren’t just words will simple definitions, but complex concepts.  

Everyday definitions of content and data don’t help us much in locating critical distinctions.  For example, Princeton University’s Wordnet lexical database provides the following short definitions of each:

  • Content: message, subject matter, substance (what a communication that is about something is about)
  • Data: information (a collection of facts from which conclusions may be drawn) “statistical data”

These definitions hint at differences, but also areas of overlap.  The distinction between a “message” and “information” is not obvious.  In everyday speech, we might say “Did you receive the message (alternatively: information) today?”  Similarly, the ideas of “substance” and “facts” seem similar.  It’s tempting to dismiss these problems as the byproduct of sloppy definitions, but I believe many of the difficulties stem from the complex and changing essence of these concepts.  

When concepts are difficult to define, many people look for examples to show what something means.  But canonical examples of familiar resources don’t help us much.  Let’s consider two ink-and-paper products that can be purchased from the US Government Printing Office.  The US Census is a canonical example of data: a compilation of cold, statistical facts.  The US Constitution is a canonical example of content — a series of statements that are rich with meaning — so much so that people passionately debate what every word means. Canonical examples can provide concrete illustrations of concepts, but most of the resources we deal with will not be so well defined.  

In the past, we categorized resources according to end user: data was for machines to process and calculate, while content was for humans to understand. Or we conceptualized content and data by the environments in which we encountered them.  For example, we viewed data as rows and columns of text and numbers in a spreadsheet or relational database.  While these could be presented to readers in a table, the rawness of the source material did not make it seem like content.  As computers began to process all our resources, the picture became more complicated.  Computers could store records, such as my university transcript, which provided an overview of my studies — a story of a sort.  Computers also stored documents.  I began using computers to create documents while in university, even though the delivered document was on paper.  Was the file on that floppy disk content or data? Conversely, I used card stock paper to create data for computers, by filling in the dots on a computer punch card — manually processing data for the benefit a comptuer.  

Based on past experience, people are often inclined to see content and data as different.  It’s also worth considering how they might be related. Earlier this month, a developer I know posted on a content strategy forum that “content is data.” Professionals working with digital resources may view content and data as having a range of relationships:

  1. They are identical (or any avowed distinctions are inconsequential)
  2. They are separate and independent from each other, with no overlap
  3. They overlap in some aspects (either sharing common properties or representing a continuum)
  4. One is a subset of the other (content is a kind of data or data is a kind of content)

I’ve encountered people who promote any one of these views, and among others.  It’s even possible to see data or content as an expendable resource with no inherent value.  For researchers in natural language processing, content is merely a “data set” — a very long string of characters to bend into shape to support different use cases.  

My own view is that content and data are fundamentally different, with only limited overlap. They coexist and complement each other, but they are distinct kinds of resources. In one specific dimension, they are becoming more alike: how they are stored and can be managed.  But they are very different in two other areas: how they are generated and created, and how they are consumed and used.

When viewed solely through the lens of technology, content and data can seem to resemble each other.  Both can be structured by models that are similar in form. Much of what makes content and data different is invisible to technology: they have different purposes, and people relate to them in different ways.  A growing source of confusion arises when individuals assume that content and data should behave the same way because they have similarities in form.  But morphological similarities are not the full story, just as the wings of pigeons, penguins, and ostriches look similar but play different roles.  

How content and data are stored and managed

Digital resources have a material presence. They take up space.  I live only a few miles from where one of the largest concentrations of server farms in the US is hosted and am keenly aware of the land and energy they need.  What’s lurking there?  What are these resources talking about?

Many developers see no intrinsic difference between data and content: both are simply digital objects with IDs.  Different models of storage — branching code repositories, file structures, XML encodings, graph databases, schema-less databases — are simply alternative ways to organize bytes of data.   Some developers refer to content as unstructured data.  According to that perspective, content is a form of data, but in a less perfect form.  The best that content can aspire to is to become “semistructured” data.

Developers encounter the terms content and data in jargon referring to the format of the resource they are dealing with.  They deal with “content types” (for HTTP requests) and “data types” (for data values stored or retrieved).  For example, text can be plain or HTML.  In this sense, content and data aren’t too different: they define a discrete payload.  Small wonder developers don’t spend much time pondering the distinctions.

Within a content management system, the term “content types” appears again but in a different sense: they define the fields for an item of content, and each field needs a data type to indicate the kind of value used.  Here, content types are made from data types and might be considered a superset of data types.

In the CMS-specific definition of content types, we see signs of semantics, which break the resource into parts that have names.  The semantics help describes what the resource is about, not just how to process it as a format.  Provided a developer knows the structure of the resources, they can query it to obtain specific elements within the resource to answer questions.  

What do the parts of a resource offer and how can they be used?  These questions are often answered in API documentation, which explains the model of the resource.  Most serious CMSs now have a content API; better ones expose their entire content model. This is having radical consequences.  Content is not tied to any specific display destination such as a website, but instead becomes a dynamic resource.  With GraphQL, a fast-growing API query standard, it’s become easy to combine content from different sources.  Content publishing has the potential to become multisource: distributed and federated.  Content can be exchanged between different sources, which don’t need to have the precise same understanding of what originating source had in mind.  APIs are like a phrasebook that translates between different parties.  People don’t need to speak the identical language (schema) provided that they have the phrasebook.  This is a major shift from the past when the presumption had been that everyone needed to agree to a common schema to share their resources.

Non-technical users gain access to the model of a resource by using a UI that’s connected to it.  With the shift to the cloud, consumers are increasingly unconcerned about where resources are stored.  Many resources are accessed through apps.  Some browsers now don’t display full URLs.  The address of the content — its path and location — is disappearing and along with it a concern about structure.  While the cloud is just a metaphor, it does capture the reality that resources no longer have a fixed address.  Storage containerization means resources move around.  From the consumer’s perspective, the structure is becoming invisible — seamless.  They don’t see or care about how the sausage is made.  They only care what it tastes like.

In terms of access and storage, content and data are becoming more alike.  With APIs, content is becoming more malleable, like data.  And data is getting upgraded with more semantics, becoming more descriptive and content-like. These changes are largely invisible to the consumer, until they don’t work out.  People do notice when things are askew, though they are unsure why they are.  They care not only whether the files can be accessed, but also how coherent the experience is of getting that stream of resources.

Generating and creating resources: expression

What can a digital resource talk about?  Content and data are often discussing different concerns.  Content talks about topics or stories.  Data talks about entities.  While similar in focus, content and data have different expressive potentials.

Data and content have different perspectives about: 

  • What they can mention: the properties that are discussed
  • What can be said about said about those things: the values presented  

Restricting expression

The structuring of resources will influence what resources can discuss.  And they can impose rules on values (controlling values, validating values, or restricting values), which will further limit what can be said.  Structure can limit expression. 

Content is “open domain”: it can talk about anything.  The author decides what topic or story to talk about – they are not restricted to a pre-determined universe of topics.  Once that story or topic is chosen, they can address any aspect they want to, and there are no restrictions about the attributes of the topic.  And they can say anything they want to about; there are no restrictions about the values of those attributes.    

Content, in its untamed form, is open-ended in how it discusses things. Content doesn’t require a pre-defined structure, though it’s possible to structure content to define properties that shape what dimensions get discussed.  These may be specific fields or broader ones.  Even when content is defined into structural elements, the range of values associated with these elements is open-ended (such as free text, video, or photos) and the values can discuss anything without restriction.  Restrictions on content are few: there may be a character limit on a text field or a file size limit on an image.  A few fields will only accept controlled values.  But in general, most content is composed of values created by an author.

 Data is “closed domain”: to be useful, data must be defined with a formal schema of some kind — a set of rules.  This limits what the data can talk about to what the data schema has defined already.  

Digital resources thus vary in their structure and allowed values.  Structure is not inherently good or bad — it involves a range of tradeoffs.  What’s best depends on the intent of the contributor.  We need to consider the goal of the resource.  

With data, there’s no obvious intent.  We don’t know why the data is there, or how it is supposed to be used.  But the presumption is that others will use the data, and to make the data useful, it needs to conform to certain standards relating to its structure and allowed values.

With content, the situation is different.  All content is created with an intent in mind. Sometimes that intent is vague or poorly thought out.  But content generally takes effort to create, which means there must be a motivation behind why it exists.  And by looking at the content, we can often infer its intent.  We know there’s an audience who is expected to view the content and we can make some guesses about what that audience is expecting.  The audience’s needs define both the structure and the values used. 

While data doesn’t have an audience, content does. Data doesn’t have an author, while content does.  These distinctions have implications for how resources can be used.  

Representation and ‘aboutness’ in digital resources

What’s the resource about and what’s it trying to explain?  Again, content and data focus on different aspects.

Content and data differ in what each can represent.  Content can discuss a set of facts that don’t have a predefined relationship.  Authors make decisions about what to include and how to talk about them. Content doesn’t need to be routine in what it expresses. Data is meant to convey a predefined range of facts in a precise way.  Data depends on being routine.    

Both content and data make statements, but the values for these statements are dissimilar.  Content elements hold complex values (the value may contain several ideas.)  Data elements hold simple values (normally one idea or ID per value.)

Content deals with topics and stories that ultimately are about themes, ideas, concepts, life events, and other kinds of things that are open to interpretation.  Data describes entities — concrete things in the real world or human-defined records such as invoices.  Data can only describe properties of things that can be measured according to recognized values.  Its role is to provide a consistent understanding of an entity.

A fundamental difference, then, is the approach that each uses to describe things.  Content describes these with natural language, pictures, or other forms of human communication. Data describes them in terms of their measurable properties, or their relationships to other entities.  

Data values are simple and ideally unambiguous: names, IDs, pre-decided labels, quantities, or dates.  Content is different from those simple values.  A content value can’t be easily evaluated by machines because it contains complex, multipart statements about multiple entities.  Lots of work goes into making natural language understandable to computers — finding the themes and sentiments, or recognizing when entities are mentioned.  Despite impressive progress, machines have trouble interpreting human communication.  That machines don’t find human communication reliable does not mean it’s less accurate.  Content can be both more nuanced and more compact than data.  Content may seem less precise, but it also can represent concepts and statements that data representations can’t hope to.  Data engineers are having difficulty representing even the basic features of laws and regulations, for example.

Data breaks down facts into individual statements.  By doing so, data can manage to be both specific and incomplete.  The building blocks of data allow us to say a lot about entities.  We can map the relationships between various entities, including people.  Data can tell about a couple who marry and later divorce.  But it can’t tell us why they married or why they divorced.  The enumerated values within data models can’t address complex explanations.

Data’s fundamental purpose is to make discrete, unambiguous statements about an entity.  Data will hypostatize or reify an object. The object becomes an entify-able value: its existence is stripped down into its observable and quantifiable qualities. Once an entity is converted into values, it can be evaluated.  Even actions can be treated as entities, provided they can be enumerated into categories that are fixed in meaning.   This restriction limits the data-ification of the descriptions of processes since actions can involve so much variation.  

The ability to make discrete, unambiguous statements depends on having an agreed schema to discuss an object’s properties.  Data schemas can either be opinionated or not opinionated (an opinion being a point of view that’s not universally accepted.)   The semantics (what things mean) may involve coerced agreement — much like the terms of service you must click on to use an internet service.  How one asks questions (the syntax) can involve forced agreement as well.  Schemas can contain a range of opinions:

  • Opinionated: Everyone needs to agree to the same schema to talk about what things mean (you must accept my version of the truth.)  
  • Non-opinionated: people can define their own schemas, though they will need to learn about what others have decided if they want to use someone else’s.
  • Opinionated: everyone needs to make requests in the same way (syntax) about a set of facts defined by a schema.  
  • Non-opinionated: how one asks questions can vary and the answers (truth) are language-neutral, independent of how questions are framed. 

A deep irony — and flaw — of many efforts to structure data with a standardized schema is that these initiatives tend to dictate the use of a specific query language to access and utilize the data.  “Openness” is promoted in the name of interoperability but is done by forcing everyone to adopt a particular standard for data or queries — forcing an opinion about what’s correct and acceptable, resulting in pseudo-openness.

A schema is simply a framework that provides a context to what is being discussed.  People use mental schema to interpret the world, while machines use data schemas to do that.

 Content can have meaning independently of its immediate context, provided the content is unambiguous.  A fragment of content can often stand alone, while a fragment of data can’t.  Content doesn’t need a formal schema: it relies on shared knowledge and shared meaning.  The role of structure within content is to amplify meaning beyond the meaning carried by the words or images.  The structure of content frames  scaffolding of meaning.  This represents a big difference in how content and data approach structure.  With content, the emphasis is on using structure to build up meaning: to enlarge ideas.  With data, the emphasis is on using structure to break down meaning: to locate specific details.  

Data has meaning only within a specific context. When viewed by humans, data makes sense only as part of a record that shows the context for the values presented.   For machines, data makes sense within the hierarchical or lateral relationships defined by a schema.  The innovation of semantic data is its ability to describe the meaning of data independently of a larger record.  Semantic data creates the possibility to recontextualize data.  

Data can supply precise answers to structured questions. But data does poorly when trying to convey the meaning of concepts — the big ideas that motivate and direct our behavior.  Complex concepts are difficult to express as data.  They can be given identifiers to disambiguate them from similar words that refer to different concepts.  But even a universal ID, such as a Wikidata ID, does not make the concept of “love” clear as to its meaning.  It merely tells us we aren’t talking about a band with such a name.

Data can supply a comprehensive description only when an entity can be defined entirely by data. For example, statistical categories can be defined by data.   More often, data must rely on concepts that can only be defined using content.  Even if different resources use the same term to describe a property (such as “name”), they may define those terms differently.   

Combining details to build explanations

Another aspect of expression concerns what can be presented together to create a larger whole.  Here, we aren’t looking at what can be said by a single contributor at a single point in time, but what can be combined from different sources created at different times.

Both content and data are becoming more connected: able to combine with other similar resources.   The elements within both kinds of resources are being defined more specifically with semantics indicating their meaning or purpose.  This allows larger digital resources to be composed semi-automatically, a sort of ghost authorship.

When resources are broken into elements, they can be combined into various combinations to create new resources.  On what basis are combinations made?  It’s useful to distinguish logical composition from editorial composition.  Data is concerned with logical composition, while content is concerned with editorial composition.  

Web content historically hasn’t been structured into pieces that could be separated. All content intended for publication would be created at the same time by the same author within a single body field.  The elements within the content were tightly coupled, part of a common template.   Authors enjoyed great freedom about what they could address within an item of content but had less freedom in how their content could be combined with other content they or others created.  The presentation of a web page was fixed.

Content can be composed by stitching together statements, a process done through human curation (via links) or programmatically (by filtering and gathering similar items.)  In both cases, humans decide the rules for what kinds of things belong together to compose meaningful experiences.  The difference is that hard-coded programmatic rules are applied routinely rather than once.  

Data has always been about decoupling elements to allow them to be presented in different ways.  What the data can say might be restricted, but how it can be presented can vary in many ways.  Intrinsically, data has the flexibility to combine with other data.  It’s able to either be assembled on a one-time basis from a situational query or be generated routinely from a saved query that fires routinely.

How useful are larger resources that are assembled from smaller elements?  Not all assembled resources are equally useful.

 It is easier to break apart content into data than it is to compose content from data.  Designing a successful binding agent that turns data into content is difficult. Automated journalism can generate prose from data, but the resulting content comes across as pro forma and far from engaging.

A key difference in the structure of content and data relates to intent.  Data structure specifies the meaning of a value.  Content structure will often indicate the intent of an element as well as the meaning it conveys. For example, an author has a name, which is a data value.  The author may have a bio, which is a content value.  The bio is intended to provide context about the writer, and perhaps spark interest in what they have to say.  The bio presents some facts about the author, but its purpose is broader.

Data elements don’t have predefined purposes.  This allows them to be displayed in any number of combinations.  Content elements are less independent in how they can be displayed.  When content presents a topic, patterns of elements tend to be grouped together, and hierarchies of elements address broader and more specific aspects.  

To build an explanation, the elements of a resource need to work together to provide a unified understanding of a larger topic.  Content management has moved from a tightly-coupled structure to a loosely-coupled one, especially with the development of headless content models.  Elements can now be remixed into new presentations.  But audiences must perceive these elements as belonging together.  Loose-coupling is done by relating content types: collections of editorially-compatible elements.   

The simplest form of relations defines whether something is “part of” another or is “kind of” another.  Both these relationships enable aggregation of elements discussing smaller entities into discussions of larger ones. 

Both content and data are becoming explicit in indicating what individual elements mean. Despite that similarity, they remain different.  People confuse the concepts of structured content and structured data, yet these concepts are different in their orientation.  Both can be building blocks to generate more sophisticated resources.  But the materiality of those blocks is different.  

The structure of content reflects editorial intent.  That intent is often specific to the publisher, which is one reason why standardizations of content structure across publishers have not materialized. The area of technical documentation is a notable exception: it has tried to standardize the expression of content, with mixed results.  Some transactional content is indeed amenable to standardization: cases where people don’t want to think at all because they don’t have an opinion about the material, they trust the advice, and they don’t worry about the consequences of the advice.  API documentation is an example of documentation where the structure used by different publishers is converging, because of the high degree of similarity in its functional purpose.   

But the trend now seems to move away from making content seem like an interchangeable commodity. Content needs to sound human. The rising prominence of intention-focused approaches such as content design and UX writing reflect a growing public wariness toward cookie-cutter content for even “dry” technical and instructional topics.  Audiences don’t trust content that seems formulaic, because it sounds repetitive — even robotic. When all content follows the same limited patterns, it looks the same and people have trouble noticing what is different about each piece.  They question predefined answers that seem too tidy and lack background explanation.  Content needs to sound conversational if it hopes to garner attention — that’s true even for technical, factually-oriented content.  People hesitate if they feel they’re are being blinded by details — snowed.  Every detail should seem necessary to the larger purpose of what they are seeing. 

Editorial intent is about providing coherence to audiences.  The structure of content needs to support coherence.  It’s the opposite of the data-centric approach of the “mash-up”:  a jarring experience involving a mishmash of items that were never meant to be presented together.  

To assemble content elements successfully into a whole, the pieces need to be designed so they fit together. The pieces should support one another to provide a richer explanation than they would if they were viewed individually.  Generic standards for content elements have never seemed coherent: they have prioritized splitting things apart rather than gluing things together.  Most of them focus on factually heavy content that few people would want to read in total.  It’s possible to combine content from different sources, but the editorial intent of each element needs to fit together. The structure should be seamlessly supporting the larger message.  It shouldn’t be fragmenting the topic to where the relationship among elements is not reinforced.

Public expectations for data are different.  For data to be considered reliable, people expect it to look “regular.”  Consumers don’t want to see an asterisk appended to data, a footnote explaining some nuance.  Data is less trustworthy when posing as solid facts but presented tentatively or teasingly.   It also becomes less trustworthy when it is presented in inconsistent ways.  We expect data to be predictable and wonder what’s being hidden if it is presently in an unexpected manner. Predictable routines are more effective for data.  The regularity in data makes it seem more trustworthy and reliable.  People expect data to offer accuracy, not broader meaning. 

While content and data are different, there’s still a big push to treat content as if it was data.  We’ve seen how earlier attempts to do this, such as mash-ups, were widely unsuccessful because they were incoherent and fragmented the experience for audiences.  More recently, data enthusiasts have promoted the application of semantics to eliminate differences in how publishers describe and use their content.  It’s important to address the potential and limitations of semantics. 

Digital resources rely on semantics. But in the view of some, these resources don’t rely on semantics enough.  Some argue that all digital resources can and should be described with a common model: the RDF data model.   

The RDF data model represents the vision of what the inventor of the internet, Sir Tim Berners-Lee, thought the web should become: a connected body of commonly described data, or linked data. I’ve seen efforts to use the RDF data model to publish content rather than data.   But I’ve never seen an RDF data model that reflects editorial priorities, structuring content the way audiences expect them.  RDF works well for data but when used for content, it delivers dull, fact-based pages that at best resembles an encyclopedia. An interesting art museum collection gets transformed into a lifeless database dangling with confusing categories and options. These efforts reflect a belief that details are more important than narratives.   

Wikipedia, the most popular encyclopedia on the web, starts with content, rather than data. Wikipedia offers templates to provide an editorial framework for the content, which enables structural uniformity but doesn’t impose it.  Wikipedia is one of the most advanced experiments attempting to harmonize content with a data model, but it highlights the limitations that even a single publisher can have doing that.  It’s possible to extract data from Wikipedia but the complexity of its content has defied efforts to normalize it into a predictable data model.  

How resources are consumed and used

The flip side of expression is interpretation.  The meaning of a resource is not only a function of how it’s represented — the semantics.  Its meaning depends on how it’s received and evaluated — what humans and machines notice and understand.  

 The possibilities for machine-assisted interpretation are growing as data becomes richer, especially when it is highly curated.  People consume data — the fields of data visualization and data storytelling, for example, have exploded in the past decade, and these human-centered experiences can be machine-generated.  And we can no longer assume that stuff traditionally considered content — narrative text, for example — is never consumed by machines.  The range of what machines can do keeps growing, but they remain fundamentally different from humans in what they aim to accomplish.  

Who’s the resource for?

Let’s look at who needs or will want to use a resource. Does everyone need it, or only some people?  Many debates about the importance of standards used to describe resources arise because of different presumptions about who needs them.  Often the audience for the resource is never clearly defined.

Those who advocate for standards believe that everyone needs a shared basis of understanding: a common schema. A related belief is that everyone needs to know the same kinds of things from the resource, so using one common standard will be adequate for everyone.  Those who don’t embrace common standards consider them as extra work, often getting in the way of what’s needed to provide a complete explanation.    

If everyone agrees about the semantics — what to describe and what that signifies —  it can amplify the understanding and the potential utility of resources.  But reaching a universal agreement is difficult to achieve because people in practice want different things from resources.  They have different goals, preferences, priorities, and beliefs about the utility of various resources. What gets represented in standards is a reflection of what the standards makers want to be consumed.  Some people’s priorities don’t fit within what the standards committee is interested in supporting.  And when resources rely on semantic conventions outside of the agreed mainstream, it can potentially limit how those resources are acknowledged by IT systems that rely on these mainstream standards.  Standards are a closed system: they encourage connections within the universe of standards-adopters while indifferently ignoring outsiders.  Common standards can promote shared understanding. But they can also throttle expression by limiting the scope of what can be said.  

Data needs a schema to indicate how different bits of information are related to one another — otherwise, the data is not intelligible. Many data schemas are self-defined to reflect the requirements of the publisher.  Some data only moves around within a single publisher.  But because data is commonly exchanged between different parties, data will often utilize a common schema of some sort.  Externally-defined standards support exchanging resources with other parties. But it’s a mistake to see them as frictionless and cost-free.  

Digital resources that adhere to “open standards” present themselves as universally usable and available.  But in practice, they can become a walled garden that’s not accessible to all, due to stealth technical hurdles or barriers.  To understand this apparent paradox, one first needs to step back and ask how inclusive the standard’s development and adoption are. Standards are created by committees of often like-minded people who have specific agendas about what they want the standard to do.  If a narrow group was responsible for developing a standard, it may look like a consensus but it may never have achieved broad interest and wide adoption.  This can happen when standards are opinionated — they seem to work great, as long as one’s willing to sign on to the concepts, presumptions, and limitations of the standard.  And every standard has these constraints, though few are keen to advertise them.  Practices that are looser in their demands tend to be more flexible (and practical) and hence more widely adopted, and in the process become de-facto standards that are more impactful than many official ones.  

Data relies on APIs to broker what’s needed by different parties.  For people using apps that must access data from various sources, machines need to exchange data and APIs indicate how to do that.  Frequently, only select data needs to be available to a limited number of other parties. In order words, everyone does not need everything. A lot of data is personal to an individual: your streaming music playlist, your car service history, or your vacation plans.  While some people want to broadcast their personal data, many people are reluctant to do that, concerned about their personal privacy and cybersecurity.  

Most data is exchanged using custom APIs, where the provider of data describes its schema. Most data does not conform to a universal standard such as the RDF data model, and only a minority of IT systems support RDF.  

Yet some data is considered public and is made universally available.  Data described with RDF can connect easily to other data that are  described with that model.  RDF is most commonly used for non-proprietary data since there’s an incentive to connect such data widely.  

The audience for RDF-defined data is unique in several respects.  RDF schemas tend to be designed by technical people for other technical people: engineers, scientists, or data analysts. The people creating the schema are largely the same people using the data described by the schema. They share a mental model of what’s important. The data that’s targeted is presumed to be a public good: it should be available to everyone, who can access it by using a commonly adopted schema.  Semantic standards assume that data must have a universal identifier because anyone might need to use the data in its raw form.  This sentiment is most expressed in the notion of “open data”: that data must be downloadable and reusable by anyone — you agree to surrender rights to how what you create is used.  The “authorship” of data, unlike content, is often anonymous.  Data is easily replicated, becoming widely available.  It can quickly become public domain, where it’s not unique enough to merit copyright.

Unlike data, content is neither private nor public.  It’s meant to be widely used but normally it’s neither individually-focused or universally needed.  Not everyone will want or potentially need the same content.  They may want the content from specific sources since the source of the content is part of the context audiences consider when evaluating the content. Content is generally proprietary: it isn’t meant to mix freely with content from other sources.  Content normally has an identifiable individual or institutional author, unlike the case with much data.  To have value, content should be unique, and by extension be subject to copyright.   

For a scientist, any data is potentially useful because you never know ahead of time what facts might be connected to others.  For a teenager viewing a smartphone screen, only some content will ever be of interest.  Content, more than data, is situationally relevant; the requirement of having a universal identifier is not as great.  Content needs an identifier of some sort, but since not everyone in the world needs to be able to access all content, every piece of content doesn’t need a globally unique ID.  Most people only need a way to access the assembled content, such as a web page.  They don’t need access to the individual elements that comprise that web page. That task is delegated to an API, which worries about interpreting what’s wanted with what’s available.

Government-funded organizations often have an obligation to make everything they publish be described using an open standard. Public funding of data tends to drive data openness.  But the data itself may only be of interest to a small group of people.  Just because data utilizes common standards doesn’t imply there’s a universal demand for the data.  Data about the DNA of fruit flies or archeological artifacts from Crete may be of public interest, but they aren’t necessarily of universal interest.

What’s available to use? 

How resources can be used to some extent depends on the relationship between the parts and the whole.  

  • Do the parts have meaning when separated from the whole?  
  • Do different parts when combined resemble a coherent whole?

We can think about digital resources as fulfilling two different scenarios.  First, we have cases where the user is expecting something specific from the resource: the known-knowns.  They have a question to get answered, or a predefined experience they seek, such as listening to a specific music track.  Second, we have cases where the user doesn’t have strong expectations: either known-unknowns or even unknown-knowns.  

The issue more complex than first appears because we haven’t fully defined who the user is.  Does the user directly make their choices on their own, or do they rely on an intermediary (a machine or editor) to guide their choice?

What’s the resource for?

Both data and content can answer questions, but the kinds of questions they can answer are different.  Questions vary:

  • What does the user want to know?  
  • Do they have a specific question in mind?  
  • Does the answer provide a complete picture of what they need to know, or does it hide some important qualification or comparative? 

This distinction between data and content has practical consequences for the design of APIs that offer answers. What can be queried?  Does the value need a label to explain what it is?  Is the author of an API query a curator of content, or the seeker of knowledge?  A curator intermediator will be inclined to read the API documentation, while the knowledge seeker wants direct answers.

Both people and machines consume resources — they try to make sense of them and act on what they say.  When resources are consumed, the identity of what’s being discussed in the resource can be a vexing issue.  What is it talking about, and is it the right thing of interest?  Data IDs can help us be more precise in specifying and locating items, but they can also be misleading.  Both machines and humans presume that identical strings of characters indicate equivalent items.  In the case of humans, this is most true when the string is uncommon and assumed to refer to something unique.  Machines are normally less discriminating about how common something is: they are greedy in making inferences unless programmed not to.  Machines assume every matching set of strings refers to the same thing.  

Of course, just because different items have the same string label or ID, that doesn’t mean they are referring to the same idea.  The hashtags in folksonomies illustrate this problem.  Catchy words or phrases can become popular but have diverging meanings.  

Humans have trouble when talking about concepts, because the same word or phrase may imply different things to different people, what’s known as polysemy.  Nearly everyone has their own idea about what love means, even while nearly everyone is happy to use this four-character string.  Identity (labels or IDs) is not the same as semantics (shared meaning).  In natural language conversation, ambiguity is a recognized problem and to some extent expected, with clarifying questions a common countermeasure.  Suppose a speaker is talking about “personalization.”  The listener may wonder: what does he mean?  The listener may supply her own definition, which may be the same as the speaker’s, or slightly different — or vastly different.  The listener may use a different name to refer to what the speaker considers “personalization” — to the listener, the speaker is talking about “customization.”  It’s also possible that the speaker and listener agree about the definition of personalization, but still conceptualize it differently: one sees it as a specific kind of algorithm while the other considers it a specific kind of online behavior.  Even shared definitions don’t imply shared mental models, which involve assumptions outside of simple definitions.  

 Concepts, whether familiar or technical, often lack precise boundaries or universal definitions.  Humans often encounter others using the same terminology to refer to a concept, only to find that others don’t embrace the same meaning or perspective about it.  

Much of what passes for semantics in the computer realm is more about agreed identifiers than agreed meaning.  A “thing” that can be precisely identified can have multiple identities — depending on who is perceiving.  Machines are unable to distinguish the role of denotation from connotation.  Consider an emoji of a facial expression, which has a specific Unicode ID.  The Unicode ID denotes a visual face of some sort.  But the meaning of that facial expression emoji is subject to interpretation.  A single Unicode character can generate polysemy — multiple interpretations.  A Globally Unique ID can spark false confidence that everyone will understand the entity in the same way.

The promise and peril of IDs are that they act as stand-ins for a bunch of statements.  In natural language, we’d call them loaded-phrases: they trigger all kinds of associations. Is the terminology or labels used for a concept the changing, or is the identity of concept changing?   But maybe the ID isn’t precise?  We tend to describe things, and how we describe things tends to define them.  Maybe the descriptive properties aren’t relevant to what’s being said.  Maybe they aren’t useful at all.  

 Shades of truth

Data are about indirect experience.  They provide detached information that’s been recorded by someone or something else.  Data are generally treated as “facts” — they’re considered objective.  Content doesn’t represent itself so absolutely.  It speaks about the author or reader’s direct experience — personally acquired or understood information.  Content is about expression, which involves individual interpretation.  Notionally, data is free of interpretation, while content supplies it.  

As mentioned earlier, data is exchanged between systems and so needs to be defined with enough precision to allow that to happen.  The value of data (in theory at least) is independent of its source.  Its utility and accuracy are presumed so where the data comes from is less a concern.  Do we care who publishes a list of US presidents or where Google gets its answers presented in its knowledge panel box?  If we believe data is intrinsically objective — that data isn’t false or misleading — we don’t.  

Content — subjective and embodying an editorial point of view — is a reflection of its publishing source.  Audiences evaluate content partly by the reputation of the source. They evaluate content not just for its factual accuracy but its completeness, fairness, insight, transparency, and other qualities.   People value content for being unique.  Content can’t always easily measured or compared directly against other content.  People rely on their judgment rather than some calculated comparison.  

The presumption of knowledge graphs is that KNOWLEDGE (yes it’s a big shouty idea) can be reduced to data.  A less glamorous name for knowledge graphs is linked data, the term that Tim Berners-Lee championed two decades ago. The terms knowledge graphs and linked data are nearly synonymous, the main difference being that knowledge graphs involve a curated set of data (curation = editorial choices).  Because knowledge graphs are promoted as the answer to nearly every problem and possibility facing humankind, it’s important to be clear what knowledge graphs can and can’t do.  Knowledge graphs advance our ability to connect different facts and bring transparency to many domains. They are especially useful in supporting internally-focused enterprise use cases where experts share data with their colleagues, who can interpret the meaning and significance based on a shared understanding of its significance. Knowledge graphs are indeed useful in many scenarios.  But we must accept that the “knowledge” aspect of the term is a bit of marketing hyperbole, coined by Google.  Data, by itself, no matter how elaborately connected and explicated, does not generate knowledge.  Because knowledge requires explanation, while data can only show self-evident things.  Content, in contrast, is about presenting information that’s not self-evident to audiences.  Content explains facts.  Knowledge graphs link facts.  

Many possibilities are available to connect items of data together especially when joining tables or federating search across different sources.  But the output of data queries is still limited in what it can express.  Data are related through operators, such as equals (is), comparisons such as more than or before, and inclusion (has).  That’s a small set of expressions compared to natural language.  Semantic data is more expressive than ordinary relational data because its properties function like verbs in natural language.  But the constraint remains. For data to be manageable, the properties must be enumerated into a limited list of verbs.  Nuance is not a forte of data.  

Data’s relevance challenge

How does data establish its relevance to audiences?  Unlike content,  data was never created with a specific audience need in mind.  So how can it become relevant to people?  

Content APIs are designed around basic questions: what content will be relevant to audiences?  The API delivers specific content that provides the answer.  The body of content will be relevant to different audiences for different reasons, though no single individual will necessarily be interested in all of it. The content was created to answer specific questions.  The API’s task is to determine which slice within the content body is needed for a specific context at a specific time.  Answer-responses can be programmatically delivered using a flexible range of simple declarative questions.

Data, being more open-ended in how it can be queried, has more difficulty providing relevance reliably.  Data can answer straightforward questions easily: what’s the cheapest gadget that’s in stock, or the highest-rated gadget?  While the result doesn’t provide much explanation, the answers can point audiences to content items they may want to view content to understand more.

Knowledge graphs are supposed to turn general-purpose data into something that will be understandable and relevant to the casual inquirer. They’re able to do that when the user understands the domain already, but even then the relevance of results is uncertain.  Using knowledge graphs to extract valuable insights from data can be a deeply labor-intensive exercise  — a treasure hunt in an age when consumers expect systems will automatically tell them what they need to know.

Knowledge graphs are generally built from graph databases, which connect different types of entities and reveal their indirect relationships.  These databases have been difficult to translate into audience-facing applications.  People don’t know what to ask when dealing with indirect relationships: what relationships are potentially valuable to explore.  For this reason, we see few consumer applications using the SPARQL, the most commonly used specialized query language for knowledge graphs.  

Instead, most consumer recommendation engines operate using different sort of graph database called property graphs that are based on self-defined semantics rather than universally agreed ones.  Property graphs are optimized to find the shortest path or distance between two objects.    The goal of the recommendation is to highlight the strength of the connection rather than providing a way for the customer to drill into the multifaceted relationships.  Recommendation engineers are fundamentally different in purpose from faceted filtering.  They remove complexity from the user, rather than expose users to it.

The goal of using semantic data to “reason” — to draw non-explicit conclusions — hasn’t materialized in mainstream consumer applications.  Most semantic data is used to answer explicit questions, albeit sometimes highly complex ones.  One benefit of semantic data over traditional data is that the questions asked can be more elaborate because each item of data ultimately is relatable to other data.  The cost of this benefit is high: the answer to any question may be null — because it’s not obvious what questions yield useful answers.    The general public has limited interest in chaining together elaborate queries. They tend to ask straightforward questions.  

When questions become complex and topics are opaquely enmeshed, systems can’t expect audiences to articulate the question: they need to anticipate what’s needed.  While experts are interested in how an answer was arrived at, ordinary users on the web are more interested in getting the answer.  They want to be able to rely on curated queries that ask interesting questions that have meaningful answers.  

Richness and ambiguity in content and data

Whether content or data provides richer expression depends on many factors.  Both can be vague.  Using many words won’t necessarily make things clear.  And presenting many facts doesn’t tell you everything you need to know.

One measure of the value of a resource is its succinctness. A succinct resource can convey much.  Alternatively, it may flatten out important nuances.  Data’s value is associated with its  facticity, where facts speak for themselves and no one entertains deviant ideas about what those facts mean.  Content’s value isn’t about crystalized facts or named entities.  While content is often reduced to being about words, it’s actually about wording — interpretation, understanding, and feeling.  

Data provides confident answers to narrowly defined questions.  Content provides richer but more uncertain answers to broadly-defined ones.  

The limits of decomposition

People learn a lot by focusing on the core facts — nuggets that can be translated into data — within statements in content.  But that focus also carries a risk: the attractive but mistaken idea that all content can be boiled down to data.  People often want direct answers, which we expect can be filtered from facts.  But answers also often involve interpretations, which go beyond agreed facts. Our interpretations can diverge, even if  everyone agrees with what the facts are.

Consider a common, simple question: who is a movie for?  The content of a film can be reduced to a maturity rating. In principle, these are based on clear criteria, but in practice, they can be difficult to apply unambiguously and consistently.  Classification is often based on patterns rather than criteria satisfaction.  And the maturity rating still doesn’t tell us much about who the movie is for, even if we agree the rating is accurate.  

According to the ideas of semantic data, almost anything can be defined in terms of its properties.  An entity should explainable through its data values.  In practice, such descriptions only work for simple entities with regularized parameters — generally human-made things or abstract archetypes.  Complex things that have evolved organically over time, whether products, living beings, or ideas, can’t be easily reduced to a handful of data parameters.  They defy simple data-explication, because of:

  • Tacit qualities that can’t be articulated easily  
  • Complex attribute interactions, including irregular combinations and exceptions
  • The overloading of dimensions, where there are too many dimensions to track easily

The mundane act of identifying a bird species illustrates the issue. For many related species, it can be hard to do, as each bird species can have many attributes that vary, and many bird species have similar attributes that generate confusion in identification.  Birders rely on evaluating the “jizz” of birds. Many categories we rely on are concepts that have general tendencies rather than absolute boundaries.  For this reason, criteria-based definitions don’t aren’t satisfactory.  

Data is not also good at representing many processes and how they can change the state of things.  Consider the many ways an onion can be processed resulting in different outcomes with different property values.  The cut of onions (diagonal, minced) and cooking method (steamed, sautéed, stir-fried, deep fried) influences texture, sweetness, and so on.  The input properties influence the contours of the output properties but don’t dictate them.  There are too many subtle variables to model accurately in a data model.  Content is more efficient discussing these nuances.

Food is born from recipes: a structure of inputs yielding an output.  It’s become a stock metaphor for illustrating how content or data works.  Those who believe “content is data” love to cite the example of food recipes.  Recipes are structured, they follow a standard convention, and deal with entities and quantities.  But contrary to what most people think, recipes aren’t really data.  They are content.  They are full of nuance.

A recent semantic data research paper explored how knowledge graphs could allow substitutions in recipes.  The authors, at IBM and the Rensselaer Polytechnic Institute, explored the potential of a “knowledge graph of food.”  They had the goal of creating a database that could allow people to switch out a recipe ingredient to use something more healthful. The authors figured ingredients can be quantified according to their nutritional values: vitamins, calories, fat content, etc.  “We use linked semantic information about ingredients to develop a substitutability heuristic for automatically ranking plausible ingredient substitutions.” While the goal is laudable, the project in many ways was naive.  The authors hoped that a simple substitution would be possible with no second-order effects.  But ingredients interact in complex ways, much like the details of content do. If you want to replace cream in a recipe, there are many options with different tradeoffs. Suppose we think about recipes as formulas of ingredients.  When one ingredient changes, it changes the formula and the outcome.  The problem is that food isn’t only about chemical properties.  It’s about experiential properties.  People react to food according to its sensory qualities: taste foremost, but also texture, aroma, and appearance.  The authors acknowledge that “the quality of an ingredient substitution can be subjective and difficult to concretely determine.”  But they don’t dwell on that because it’s a problem they can’t solve.  Any cookbook writer will tell readers that substitutions are possible, but they will change the character of the dish and may necessitate other ingredient changes and compensating measures.  

When resources discuss things that are important priorities to people, they will address numerous factors they care about.   The whole will be more complex than the sum of its parts.  

Content and Data as Equals

People rely on both content and data.  It’s a mistake to view one as superior to the other.  

From the perspective of user experience, people relate to events in the world on three levels:

  1. Phenomena that we perceive or notice: the content of our experience
  2. How we identify, classify, and characterize that phenomena: the facts or data we ascribe to what we’ve encountered
  3. How we explain and evaluate the phenomena: the content of our personal explanations

Our perceptions translate into properties of an entity.   Our characterizations translate into its values.  Our explanations are complex and open-ended, like content.

Data can’t replace content.  A knowledge graph query will tell you very little about the state of knowledge graph research.  You’ll need to read PDFs of conference papers to learn that.

And modern digital content can’t be viable without drawing on data to explain entities in concrete detail.  Data can highlight what’s available within the content and clarify its details.  Data enhances content.

Standards are valuable when many different people need to do the exact same thing. A basic philosophical disagreement concerns how similar or dissimilar people’s needs are.  On one extreme is the tendency to consider every resource as unique, with no reusability. That’s the old model of web content, where data was a second-class citizen.  On the other extreme is the belief that everyone needs the exact same resources and that these can be described with standardized structure.  People cease being unique.  Content is a second-class citizen.

Content needs to move forward to develop internal publishing standards for structure: internal schemas that allow coherent assembly of different parts into meaningful wholes. That improves the range of combinations that content can present to be able to address the needs of different individuals. But the standardization of content structure can only go so far before it gets homogenized and loses its interest and value to audiences.  It’s not realistic to expect content publishers to adopt external schemas for content: they will lose their editorial voice if they do so.  

For data to increase its utility, it will need to start considering its editorial dimensions, especially who the data is will be for and what do those people need. As public wariness toward data collection by big institutions builds, ordinary people need to feel data can serve their needs and not just the needs of experts or powerful corporations.  Data curators are needed to look at how data can empower audiences.   

Data and content both have structures, which allow them to be managed in similar technical ways. Both content and data can be queried with APIs, and APIs can combine both content and data from different sources.  But despite that commonality, they have different purposes.  The goal should be to make them work together where it makes sense to, but not expect one to be hostage to each other.  

The technical possibilities for transforming content and data have never been greater.  It can be easy to become dazzled by these possibilities and idealize how beneficial they will be for people.  Too little of the discussion so far has focused on what ordinary people need. Experts need to engage more with non-experts to learn what makes digital resources truly meaningful.  

— Michael Andrews

Categories
Agility

XML, Latin, and the demise or endurance of languages

We are living in a period of great fluctuation and uncertainty.  In nearly every domain — whether politics, business, technology, or health policy — people are asking what is the foundation upon which the future will be built.  Even the very currency of language doesn’t seem solid.  We don’t know if everyone agrees what concepts mean anymore or what’s considered the source of truth.

Language provides a set of rules and terms that allow us to exchange information.  We can debate if the rules and terms are good ones — supporting expression.  But even more important is whether other groups understand how to use these rules and terms.  Ubiquity is more important than expressiveness because a rich language is not very useful if few people can understand it.

I used to live in Rome, the Eternal City.  When I walked around, I encountered Latin everywhere: it is carved on ancient ruins and Renaissance churches.  No one speaks Latin today, of course. Latin is a dead language.  Yet there’s also no escaping its legacy.  Latin was ubiquitous and is still found in scattered around in many places, even though hardly anyone understands it today.  Widely used languages such as Latin may die off over time but they don’t suddenly disappear.  Slogans in Latin still appear on our public buildings and monetary currency.  

I want to speculate about the future of the XML markup language and the extent to which it will be eternal.  It’s a topic that elicits diverging opinions, depending on where one sits.  XML is the foundation of several standards advocated by certain content professionals.  And XML is undergoing a transition: it’s lost popularity but is still present in many areas of content. What will be the future role of XML for everyday online content?  

In the past, discussions about XML could spark heated debates between its supporters and detractors.  A dozen years ago, for example, the web world debated the XHTML-2 proposal to make HTML compliant with XML. Because of its past divisiveness, discussions comparing XML to alternatives can still trigger defensiveness and wariness among some even now. But for most people, the role of XML today is not a major concern, apart from a small number of partisans who use XML either willingly or unwillingly.  Past debates about whether XML-based approaches are superior or inferior to alternatives are largely academic at this point. For the majority of people who work with web content, XML seems exotic: like a parallel universe that uses an unfamiliar language.   

Though only a minority of content professionals focus on XML now, everyone who deals with content structure should understand where XML is heading. XML continues to have an impact on many things in the background of content, including ways of thinking about content that are both good and bad.   It exerts a silent influence over how we think about content, even for those who don’t actively use it. The differences between XML and its alternatives are rarely directly discussed much now, having been driven under the surface, out of view — a tacit truce to “agree to disagree” and ignore alternatives.  That’s unfortunate, because it results in silos of viewpoints about content that are mutually contradictory.  I don’t believe choices about the structural languages that define communications should be matters of personal preferences, because many consequences result from these choices that affect all kinds of stakeholders in the near and long term.  Language, ultimately, is about being able to construct a common meaning between different parties — something content folks should care about deeply, whatever their starting views.

XML today

Like Latin, XML has experienced growth and decline.  

XML started out promising to provide a universal language for the exchange of content.  It succeeded in its early days in becoming the standard for defining many kinds of content, some of which are still widely used.  A notable example is the Android platform, first released in 2008, which uses XML for screen layouts. But XML never succeeded in conquering the world by defining all content.  Despite impressive early momentum, XML for the past decade seems to be less important each passing year.  Android’s screen layout was arguably the last major XML-defined initiative.  

A small example is XML’s demise of RSS feeds.  RSS was one of the first XML formats for content and was instrumental in the expansion of the first wave of blogging.  However, over time, fewer and fewer blogs and websites actively promoted RSS feeds.  RSS is still used widely but has been eclipsed by other ways of distributing content.  Personally, I’m sorry to see RSS’s decline.  But I am powerless to change that.  Individuals must adapt to collectively-driven decisions surrounding language use.    

By 2010, XML could no longer credibly claim to be the future of content.  Web developers were rejecting XML on multiple fronts:

  • Interactive websites, using an approach then referred to as AJAX (the X standing for XML), stopped relying on XML and started using the more web-friendly data format known as JSON, designed to work with Javascript, the most popular web programming language. 
  • The newly-released HTML5 standard rejected XML compatibility.  
  • The RESTful API standard for content exchange started to take off, which embraced JSON over XML.  

Around the same time, web content creators were getting more vocal about “the authoring experience” — criticizing technically cumbersome UIs and demanding more writer-friendly authoring environments.  Many web writers, who generally weren’t technical writers or developers, found XML’s approach difficult to understand and use.  They preferred simpler options such as WordPress and Markdown.  This shift was part of a wider trend where employees expect their enterprise applications to be as easy to use as their consumer apps. 

The momentum pushing XML into a steady decline had started.  It retreated from being a mainstream approach to becoming one used to support specialized tasks.  Its supporters maintained that while it may not be the only solution, it was still the superior one.  They hoped that eventually the rest of the world would recognize the unique value of what XML offered and adopt it, scrambling to play catch up.  

That faith in XML’s superiority continues among some.  At the Lavacon content strategy conference this year, I continued to hear speakers, who may have worked with XML for their entire careers, refer to XML as the basis of “intelligent content.”  Among people who work with XML, a common refrain is that XML makes content future-ready.  These characterizations imply that if you want to be smarter with content and make it machine-ready, it needs to be in XML.  The myth that XML is the foundation of the future has been around since its earliest days.  Take the now-obscure AI markup language, AIML, created in 2001, which was an attempt to encode “AI” in XML.  It ended up being one of many zombie XML standards that weren’t robust enough for modern implementations and thus weren’t widely used.  Given trends in XML usage, it seems likely that other less mainstream XML-centric standards and approaches will face a similar fate.  XML is not intrinsically superior to other approaches.  It is simply different, having both strengths and weaknesses.  Teleological explanations  — implying a grand historical purpose — tend to stress the synergies between various XML standards and tools that provide complementary building blocks supporting the future. Yet they can fail to consider the many factors that influence the adoption of specific languages.  

The AIML example highlights an important truth about formal IT languages: simply declaring them as a standard and as open-source does not mean the world is interested in using them.  XML-based languages are often promoted as standards, but their adoption is often quite limited.  De facto standards — ones that evolve through wide adoption rather than committee decisions — are often more important than “official” standards.  

What some content professionals who advocate XML seem to under-appreciate is how radically developments in web technologies have transformed the foundations of content.  XML became the language of choice for an earlier era in IT when big enterprise systems built in Java dominated.  XML became embedded in these systems and seemed to be at the center of everything.  But the era of big systems was different from today’s.  Big systems didn’t need to talk to each other often: they tried to manage everything themselves.  

The rise of the cloud (specifically, RESTful APIs) disrupted the era of big systems and precipitated their decline.  No longer were a few systems trying to manage everything.  Lots of systems were handling many activities in a decentralized manner.  Content needed to be able to talk easily to other systems. It needed to be broken down into small nuggets that could be quickly exchanged via an API.   XML wasn’t designed to be cloud-friendly, and it has struggled to adapt to the new paradigm. RESTful APIs depend on easy, reliable and fast data exchanges,” something XML can’t offer. 

A few Lavacon speakers candidly acknowledged the feeling that the XML content world is getting left behind.  The broader organization in which they are employed  — marketing, developers, and writers — aren’t buying into the vision of an XML-centric universe.  

And the facts bear out the increasing marginalization of XML.  According to a study last year by Akamai, 83% of web traffic today is APIs and only 17% is browsers.  This reflects the rise of smartphones and other new devices and channels.  Of APIs, 69% use the JSON format, with HTML a distant second. “JSON traffic currently accounts for four times as much traffic as HTML.” And what about XML?   “XML traffic from applications has almost disappeared since 2014.”  XML is becoming invisible as a language to describe content on the internet.

Even those who love working with XML must have asked themselves: What happened?  Twenty years ago, XML was heralded as the future of the web.  To point out the limitations of XML today does not imply XML is not valuable.  At the same time, it is productive to reality-check triumphalist narratives of XML, which linger long after its eclipse.  Memes can have a long shelf life, detached from current realities.  

XML has not fallen out of favor because of any marketing failure or political power play.  Broader forces are at work. One way we can understand why XML has failed, and how it may survive, is by looking at the history of Latin.

Latin’s journey from universal language to a specialized vocabulary

Latin was once one of the world’s most widely-used languages.  At its height, it was spoken by people from northern Africa and western Asia to northern Europe.

The growth and decline of Latin provides insights into how languages, including IT-flavored ones such as XML, succeed and fail.  The success of a language depends on expressiveness and ubiquity.

Latin is a natural language that evolved over time, in contrast to XML, which is a formal language intentionally created to be unambiguous.  Both express ideas, but a natural language is more adaptive to changing needs.  Latin has a long history, transforming in numerous ways over the centuries.

In Latin’s early days during the Roman Republic, it was a widely-spoken vernacular language, but it wasn’t especially expressive.  If you wanted to write or talk about scientific concepts, you still needed to use Greek.  Eventually, Latin developed the words necessary to talk about scientific concepts, and the use of Greek by Romans diminished.  

The collapse of the Roman Empire corresponded to Latin’s decline as a widely-spoken vernacular language.  Latin was never truly monolithic, but without an empire imposing its use, the language fragmented into many different variations, or else was jettisoned altogether.  

In the Middle Ages, the Church had a monopoly on learning, ensuring that Latin continued to be important, even though it was not any person’s “native” language.  Latin had become a specialized language used for clerical and liturgical purposes.  The language itself changed, becoming more “scholastic” and more narrow in expression. 

By the Renaissance, Latin morphed into being a written language that wasn’t generally spoken. Although Latin’s overall influence on Europeans was still diminishing, it experienced a modest revival because legacy writings in Latin were being rediscovered.  It was important to understand Latin to uncover knowledge from the past — at least until that knowledge was translated into vernacular languages.  It was decidedly “unvernacular”: a rigid language of exchange.  Erasmus wrote in Latin because he wanted to reach readers in other countries, and using Latin was the best means to do that, even if the audience was small.  A letter written in Latin could be read by an educated person in Spain or Holland, even if those people would normally speak Spanish or Dutch.   Yet Galileo wrote in Italian, not Latin, because his patrons didn’t understand Latin.  Latin was an elite language, and over time size of the elite who knew Latin became smaller.

Latin ultimately died because it could not adapt to changes in the concepts that people needed to express, especially concerning new discoveries, ideas, and innovations.

Latin has transitioned from being a complete language to becoming a controlled vocabulary.  Latin terms may be understood by doctors, lawyers, or botanists, but even these groups are being urged to use plain English to communicate with the public.  Only in communications among themselves do they use Latin terms, which can be less ambiguous than colloquial ones. 

Latin left an enduring legacy we rarely think about. It gave us the alphabet we use, allowing us to write text in most European languages as well as many other non-European ones.  

XML’s future

Much as the collapse of the Roman Empire triggered the slow decline of Latin, the disruption of big IT systems by APIs has triggered the long term decline of XML.  But XML won’t disappear suddenly, and it may even change shape as it tries to find niche roles in a cloud-dominated world.  

Robert Glushko’s book, The Discipline of Organizing, states: “‘The XML World’ would be another appropriate name for the document-processing world.”  XML is tightly fused to the concept of documents — which are increasingly irrelevant artifacts on the internet.  

The internet has been gradually and steadily killing off the ill-conceived concept of “online documents.”  People increasingly encounter and absorb screens that are dynamically assembled from data.  The content we read and interact with is composed of data. Very often there’s no tangible written document that provides the foundation for what people see.  People are seeing ghosts of documents: they are phantom objects on the web. Since few online readers understand how web screens are assembled, they project ideas about what they are seeing.  They tell themselves they are seeing “pages.” Or they reify online content as PDFs.  But these concepts are increasingly irrelevant to how people actually use digital content.  Like many physical things that have become virtual, the “online document” doesn’t really resemble the paper one.  Online documents are an unrecognized case of skeuomorphism.

None of this is to say that traditional documents are dead.  XML will maintain an important role in creating documents. What’s significant is that documents are returning to their roots: the medium of print (or equivalent offline digital formats).  XML originally was developed to solve desktop publishing problems.  Microsoft’s Word and PowerPoint formats are XML-compatible, as is Adobe’s PDF format. Both these firms are trying to make these “office” products escape the gravity-weight of the document and become more data-like.  But documents have never fit comfortability in an interactive, online world.  People often confuse the concepts of “digital” and “online”.  Everything online is digital, but not everything digital is online or meant to be.  A Word document is not fun to read online.  Most documents aren’t.  Think about the 20-page terms and conditions document you are asked to agree to.  

A document is a special kind of content.  It’s a highly ordered large-sized content item.  Documents are linear, with a defined start and finish.  A book, for example, starts with a title page, provides a table of contents, and ends with an appendix and index  Documents are offline artifacts.  They are records that are meant to be enduring and not change. Most online content, however, is impermanent and needs to change frequently. As content online has become increasingly dynamic, the need for maintaining consistent order has lessened as well.  Online content is accessed non-linearly.  

XML promoted a false hope that the same content could be presented equally well both online and offline — specifically, in print.  But publishers have concluded that print and online are fundamentally different.  They can’t be equal priorities.  Either one or the other will end up driving the whole process.  For example, The Wall Street Journal, which has an older subscriber base, has given enormous attention to their print edition, even as other newspapers have de-emphasized or even dropped theirs.  In a review of their operations this past summer, The Journal found that their editorial processes were dominated by print because print is different.  Decisions about content are based on the layout needs of print, such as content length, article and image placement, as well as the differences in delivering a whole edition versus delivering a single article.  Print has been hindering the Journal’s online presence because it’s not possible to deliver the same content to print and screen as equally important experiences.  As result, the Journal is contemplating de-emphasizing print, making it follow online decisions, rather than compete with them.

Some publishers have no choice but to create printable content.  XML will still enjoy a role in industrial-scale desktop publishing.  Pharmaceutical companies, for example, need to print labels and leaflets explaining their drugs.  The customer’s physical point of access to the product is critical to how it is used — potentially more important than any online information.  In these cases, the print content may be more important than the online content, driving the process for how online channels deliver the content.  Not many industries are in this situation and those that are can be at risk of becoming isolated from the mainstream of web developments.  

XML still has a role to play in the management of certain kinds of digital content.  Because XML is older and has a deeper legacy, it has been decidedly more expressive until recently.  Expressiveness relates to the ability to define concepts unambiguously.  People used to fault the JSON format for lacking a schema like XML has, though JSON now offers such a schema.  XML is still more robust in its ability to specify highly complex data structures, though in many cases alternatives exist that are compatible with JSON.   Document-centric sectors such as finance and pharmaceuticals, which have burdensome regulatory reporting requirements, remain heavy users of XML.  Big banks and other financial institutions, which are better known for their hesitancy than their agility, still use XML to exchange financial data with regulators. But the fast-growing FinTech sector is API-centric and is not XML-focused.  The key difference is the audience.  Big regulated firms are focused on the needs of a tightly knit group of stakeholders (suppliers, regulators, etc.) and prioritize the bulk exchange of data with these stakeholders.  Firms in more competitive industries, especially startups, are focused on delivering content to diverse customers, not bulk uploads.  

XML and content agility

The downside of expressiveness is heaviness.  XML has been criticized as verbose and heavy — much like Victorian literature.  Just as Dickensian prose has fallen out of favor with contemporary audiences, verbose markup is becoming less popular.  Anytime people can choose between a simple way or a complex one to do the same thing, they choose the simple one.  Simple, plain, direct. They don’t want elaborate expressiveness all the time, only when they need it.  

When people talk about content as being intelligent (understandable to other machines), they may mean different things.  Does the machine need to be able to understand everything about all the content from another source, or does it only need to have a short conversation with the content?  XML is based on the idea that different machines share a common schema or basis of understanding. It has a rigid formal grammar that must be adhered to. APIs are less worried about each machine understanding everything about the content coming from everywhere else.  It only cares about understanding (accessing and using) the content it is interested in (a query). That allows for more informal communication. By being less insistent on speaking an identical formal language, APIs enable content to be exchanged more easily and used more widely.  As a result, content defined by APIs more ubiquitous: able to move quickly to where it’s needed.  

Ultimately, XML and APIs embrace different philosophies about content.  XML provides a monolithic description of a huge block of content.  It’s concerned with strictly controlling a mass of content and involves a tightly coupled chain of dependencies, all of which must be satisfied for the process to work smoothly.  APIs, in contrast, are about connecting fragments of content.  It’s a decentralized, loosely coupled, bottom-up approach.  (The management of content delivered by APIs is handled by headless content models, but that’s another topic.)

Broadly speaking, APIs treat the parts as more important than the whole.  XML treats the whole as more important than the parts.  

Our growing reliance on the cloud has made it increasingly important to connect content quickly.  That imperative has made content more open.  And openness depends on outsiders being able to understand what the content is and use it quickly.  

As XML has declined in popularity, one of its core ideas has been challenged.  The presumption has been that the more markup in the content, the better.  XML allows for many layerings of markup, which can specify what different parts of text concern.  The belief was that this was good: it made the text “smarter” and easier for machines to parse and understand.  In practice, this vision hasn’t happened.  XML-defined text could be subject to so many parenthetical qualifications that it was like trying to parse some arcane legalese.  Only the author understood what was meant and how to interpret it.  The “smarter” the XML document tried to be, the more illegible it became to people who had to work with the document — other authors or developers who would do something later with the content.    Compared with the straightforward language of key-value pairs and declarative API requests, XML documentation became an advertisement pointing out how difficult its markup is to use.  “The limitations in JSON actually end up being one of its biggest benefits. A common line of thought among developers is that XML comes out on top because it supports modeling more objects. However, JSON’s limitations simplify the code, add predictability and increase readability.”  Too much expressiveness becomes an encumbrance.  

Like any monolithic approach, XML has become burdened by details as it has sought to address all contingencies.  As XML ages, it suffers from technical debt.  The specifications have grown, but don’t necessarily offer more.  XML’s situation today similar to Latin’s situation in the 18th century, when scientists were trying to use it to communicate scientific concepts.  One commenter asserts that XML suffers from worsening usability: “XML is no longer simple. It now consists of a growing collection of complex connected and disconnected specifications. As a result, usability has suffered. This is because it takes longer to develop XML tools. These users are now rooting for something simpler.”  Simpler things are faster, and speed matters mightily in the connected cloud.  What’s relevant depends on providing small details right when they are needed.

At a high level, digital content is bifurcating between API-first approaches and those that don’t rely on APIs.  An API-first approach is the right choice when content is fast-moving.  And nearly all forms of content need to speed up and become more agile.  Content operations are struggling to keep up with diversifying channels and audience segmentation, as well as the challenges of keeping the growing volumes of online content up-to-date.  While APIs aren’t new anymore, their role in leading how content is organized and delivered is still in its early stages.  Very few online publishers are truly API-first in their orientation, though the momentum of this approach is building.

When content isn’t fast-moving, APIs are less important. XML is sometimes the better choice for slow-moving content, especially if the entire corpus is tightly constructed as a single complex entity.  Examples are legal and legislative documents or standards specifications. XML will still be important in defining the slow-moving foundations of certain core web standards or ontologies like OWL — areas that most web publishers will never need to touch.  XML is best suited for content that’s meant to be an unchanging record.  

 Within web content, XML won’t be used as a universal language defining all content, since most online content changes often.  For those of us who don’t have to use XML as our main approach, how is it relevant?  I expect XML will play niche roles on the web.  XML will need to adapt to the fast-paced world of APIs, even if reluctantly.  To be able to function more agilely, it will be used in a selective way to define fragments of content.  

An example of fragmental XML is how Google uses the SSML standard, an XML-defined speech markup standard to indicate speech emphasis and pronunciation.  This standard predates the emergence of consumer voice interfaces, such as “Hey Google!” Because it was in place already, Google has incorporated it within the JSON-defined schema.org semantic metadata they use.  The XML markup, with its angled brackets, is inserted within the quote marks and curly brackets of JSON.   JSON describes the content overall, while XML provides assistance to indicate how to say words aloud. 

SVG, used to define vector graphics, is another example of fragmental XML.  SVG image files are embedded in or linked to HTML files without needing to have the rest of the content be in XML.

More generally, XML will exist on the web as self-contained files or as snippets of code.  We’ll see less use of XML to define the corpus of text as a whole.  The stylistic paradigm of XML, of using in-line markup  — comments within a sentence — is losing its appeal, as it is hard for both humans and machines to read and parse. An irony is that while XML has made its reputation for managing text, it is not especially good at managing individual words.  Swapping words out within a sentence is not something that any traditional programming approach does elegantly, whether XML-based or not, because natural language is more complex than an IT language processor.  What’s been a unique advantage of XML — defining the function of words within a sentence — is starting to be less important.  Deep learning techniques (e.g., GPT-3) can parse wording at an even more granular level than XML markup, without the overhead.  Natural language generation can construct natural sounding text.  Over time, the value of in-line markup for speech, such as used in SSML, will diminish as natural language generation improves its ability to present prosody in speech.  While deep learning can manage micro-level aspects of words and sentences, it is far from being about to manage structural and relational dimensions of content.  Different approaches to content management, whether utilizing XML or APIs connected to headless content models, will still be important.  

As happened with Latin, XML is evolving away from being a universal language.  It is becoming a controlled vocabulary used to define highly specialized content objects.  And much like Latin gave as the alphabet upon which many languages are built, XML has contributed many concepts to content management that other languages will draw up for years to come.  XML may be becoming more of a niche, but it’s a niche with an outsized influence.

— Michael Andrews