Categories
Content Engineering

Revisiting the difference between content and data

This post looks in detail at the differences between content and data.  

 TL;DR — I understand you’re too busy to read a 10,000+ word post.  No worries, I’ve pulled out some highlights in the below table.  If this data doesn’t answer your questions, you may have to read further.

ContentData
Open-domainClosed domain
No restrictions on expressionExpression is restricted
Has an intentDoesn’t have an intent
Has an authorOften anonymous
Complex valuesSingle, unambiguous values
Topics and narrativesEntities
Structure builds resourcesStructure defines boundaries
Facts discussed outside of pre-defined relationshipsFacts described through pre-defined relationships
Assembly structure is coupledAssembly structure is decoupled
Editorial compositionLogical composition
Has a specific audienceIs universally relevant
Nuanced in meaningMeaning is standardized
Focused on audience interestFocused on resource reusability
Meaning can be independent of contextMeaning is dependent on context
Build up meaningBreak down meaning
The bigger themeThe details
Often proprietaryOften public domain
Uniqueness is valuedRegularity is valued
Cares about audience attentionDoesn’t care about audience attention
Some differences between content and data

What’s the issue and why does it matter?

New approaches to managing content and data such as headless content management and knowledge graphs are altering how we work with digital resources on a technical level.  Those developments have made it even more important to address what humans need when it comes to content and data, especially since the needs of people so far have not been at the forefront of these technological developments.  Too often, the resource is considered more valuable than the people who might use the resource.  It’s time to define future approaches to accessing content and data in a more human-centered way.   The first step is to be clear about the differences between how people use content and data.    

While the difference between content and data may seem like an idle philosophical question, it goes to the heart of how we conceptualize and imagine our use of resources online. Probing the distinction allows us to examine our sometimes unconscious assumptions about the value of different resources.  Content and data are basic building blocks for communication and understanding.  Yet I’ve been surprised how differently professionals think about their respective value, potential, and limitations.  My thinking about this topic has evolved as well, as I watch changes in the possibilities for working with both but also encounter sometimes idealized notions about what they can accomplish.  

Our mental model of digital resources influences how to work with and value them.  Many people consider content and data as different things, even if they can’t delineate precisely how they differ.  They often have different perspectives of the value of each.  Some popular tropes illustrate this.  Data is the “new oil” — the value of content is to generate or extract data.  Content is “king” (or queen) – data exists to support content.   In both these perspectives, content or data are seen as raw material — a means to an end — though they disagree on what the end is.   

When we talk about content and data, we rarely define what these terms mean. This situation has been true since the early days of the web.  We’ve made little progress in understanding what various digital resources mean to people and the picture keeps getting more complex.  Content and data are becoming more intertwined, but they aren’t necessarily converging.  

On a technical level, various mental models people have about resources get translated into formal schemas.  How we architect resources influences how people can use them. 

Are content and data different categories of resources?

Professionals who work with digital resources in different roles don’t share the same understanding of how content relates to data.  Until recently that wasn’t a big problem, because content and data lived in separate silos.  People who worked with content could comfortably ignore the details of data, and those with a data focus were indifferent to content.  

Outside of those who create and manage content, content is still largely ignored in the IT world.  There’s always been more of a focus on information or knowledge.  When you avoid discussing content and understanding its role, you can slip into a shaky discussion about delivering information or providing “knowledge” to users without considering what audiences actually need.  

But the historical silos between content and data are slowly coming down, and it’s becoming more important to understand how content and data are related.  Unfortunately, it’s not so simple to express their differences succinctly, because they vary in many different dimensions.  They aren’t just words will simple definitions, but complex concepts.  

Everyday definitions of content and data don’t help us much in locating critical distinctions.  For example, Princeton University’s Wordnet lexical database provides the following short definitions of each:

  • Content: message, subject matter, substance (what a communication that is about something is about)
  • Data: information (a collection of facts from which conclusions may be drawn) “statistical data”

These definitions hint at differences, but also areas of overlap.  The distinction between a “message” and “information” is not obvious.  In everyday speech, we might say “Did you receive the message (alternatively: information) today?”  Similarly, the ideas of “substance” and “facts” seem similar.  It’s tempting to dismiss these problems as the byproduct of sloppy definitions, but I believe many of the difficulties stem from the complex and changing essence of these concepts.  

When concepts are difficult to define, many people look for examples to show what something means.  But canonical examples of familiar resources don’t help us much.  Let’s consider two ink-and-paper products that can be purchased from the US Government Printing Office.  The US Census is a canonical example of data: a compilation of cold, statistical facts.  The US Constitution is a canonical example of content — a series of statements that are rich with meaning — so much so that people passionately debate what every word means. Canonical examples can provide concrete illustrations of concepts, but most of the resources we deal with will not be so well defined.  

In the past, we categorized resources according to end user: data was for machines to process and calculate, while content was for humans to understand. Or we conceptualized content and data by the environments in which we encountered them.  For example, we viewed data as rows and columns of text and numbers in a spreadsheet or relational database.  While these could be presented to readers in a table, the rawness of the source material did not make it seem like content.  As computers began to process all our resources, the picture became more complicated.  Computers could store records, such as my university transcript, which provided an overview of my studies — a story of a sort.  Computers also stored documents.  I began using computers to create documents while in university, even though the delivered document was on paper.  Was the file on that floppy disk content or data? Conversely, I used card stock paper to create data for computers, by filling in the dots on a computer punch card — manually processing data for the benefit a comptuer.  

Based on past experience, people are often inclined to see content and data as different.  It’s also worth considering how they might be related. Earlier this month, a developer I know posted on a content strategy forum that “content is data.” Professionals working with digital resources may view content and data as having a range of relationships:

  1. They are identical (or any avowed distinctions are inconsequential)
  2. They are separate and independent from each other, with no overlap
  3. They overlap in some aspects (either sharing common properties or representing a continuum)
  4. One is a subset of the other (content is a kind of data or data is a kind of content)

I’ve encountered people who promote any one of these views, and among others.  It’s even possible to see data or content as an expendable resource with no inherent value.  For researchers in natural language processing, content is merely a “data set” — a very long string of characters to bend into shape to support different use cases.  

My own view is that content and data are fundamentally different, with only limited overlap. They coexist and complement each other, but they are distinct kinds of resources. In one specific dimension, they are becoming more alike: how they are stored and can be managed.  But they are very different in two other areas: how they are generated and created, and how they are consumed and used.

When viewed solely through the lens of technology, content and data can seem to resemble each other.  Both can be structured by models that are similar in form. Much of what makes content and data different is invisible to technology: they have different purposes, and people relate to them in different ways.  A growing source of confusion arises when individuals assume that content and data should behave the same way because they have similarities in form.  But morphological similarities are not the full story, just as the wings of pigeons, penguins, and ostriches look similar but play different roles.  

How content and data are stored and managed

Digital resources have a material presence. They take up space.  I live only a few miles from where one of the largest concentrations of server farms in the US is hosted and am keenly aware of the land and energy they need.  What’s lurking there?  What are these resources talking about?

Many developers see no intrinsic difference between data and content: both are simply digital objects with IDs.  Different models of storage — branching code repositories, file structures, XML encodings, graph databases, schema-less databases — are simply alternative ways to organize bytes of data.   Some developers refer to content as unstructured data.  According to that perspective, content is a form of data, but in a less perfect form.  The best that content can aspire to is to become “semistructured” data.

Developers encounter the terms content and data in jargon referring to the format of the resource they are dealing with.  They deal with “content types” (for HTTP requests) and “data types” (for data values stored or retrieved).  For example, text can be plain or HTML.  In this sense, content and data aren’t too different: they define a discrete payload.  Small wonder developers don’t spend much time pondering the distinctions.

Within a content management system, the term “content types” appears again but in a different sense: they define the fields for an item of content, and each field needs a data type to indicate the kind of value used.  Here, content types are made from data types and might be considered a superset of data types.

In the CMS-specific definition of content types, we see signs of semantics, which break the resource into parts that have names.  The semantics help describes what the resource is about, not just how to process it as a format.  Provided a developer knows the structure of the resources, they can query it to obtain specific elements within the resource to answer questions.  

What do the parts of a resource offer and how can they be used?  These questions are often answered in API documentation, which explains the model of the resource.  Most serious CMSs now have a content API; better ones expose their entire content model. This is having radical consequences.  Content is not tied to any specific display destination such as a website, but instead becomes a dynamic resource.  With GraphQL, a fast-growing API query standard, it’s become easy to combine content from different sources.  Content publishing has the potential to become multisource: distributed and federated.  Content can be exchanged between different sources, which don’t need to have the precise same understanding of what originating source had in mind.  APIs are like a phrasebook that translates between different parties.  People don’t need to speak the identical language (schema) provided that they have the phrasebook.  This is a major shift from the past when the presumption had been that everyone needed to agree to a common schema to share their resources.

Non-technical users gain access to the model of a resource by using a UI that’s connected to it.  With the shift to the cloud, consumers are increasingly unconcerned about where resources are stored.  Many resources are accessed through apps.  Some browsers now don’t display full URLs.  The address of the content — its path and location — is disappearing and along with it a concern about structure.  While the cloud is just a metaphor, it does capture the reality that resources no longer have a fixed address.  Storage containerization means resources move around.  From the consumer’s perspective, the structure is becoming invisible — seamless.  They don’t see or care about how the sausage is made.  They only care what it tastes like.

In terms of access and storage, content and data are becoming more alike.  With APIs, content is becoming more malleable, like data.  And data is getting upgraded with more semantics, becoming more descriptive and content-like. These changes are largely invisible to the consumer, until they don’t work out.  People do notice when things are askew, though they are unsure why they are.  They care not only whether the files can be accessed, but also how coherent the experience is of getting that stream of resources.

Generating and creating resources: expression

What can a digital resource talk about?  Content and data are often discussing different concerns.  Content talks about topics or stories.  Data talks about entities.  While similar in focus, content and data have different expressive potentials.

Data and content have different perspectives about: 

  • What they can mention: the properties that are discussed
  • What can be said about said about those things: the values presented  

Restricting expression

The structuring of resources will influence what resources can discuss.  And they can impose rules on values (controlling values, validating values, or restricting values), which will further limit what can be said.  Structure can limit expression. 

Content is “open domain”: it can talk about anything.  The author decides what topic or story to talk about – they are not restricted to a pre-determined universe of topics.  Once that story or topic is chosen, they can address any aspect they want to, and there are no restrictions about the attributes of the topic.  And they can say anything they want to about; there are no restrictions about the values of those attributes.    

Content, in its untamed form, is open-ended in how it discusses things. Content doesn’t require a pre-defined structure, though it’s possible to structure content to define properties that shape what dimensions get discussed.  These may be specific fields or broader ones.  Even when content is defined into structural elements, the range of values associated with these elements is open-ended (such as free text, video, or photos) and the values can discuss anything without restriction.  Restrictions on content are few: there may be a character limit on a text field or a file size limit on an image.  A few fields will only accept controlled values.  But in general, most content is composed of values created by an author.

 Data is “closed domain”: to be useful, data must be defined with a formal schema of some kind — a set of rules.  This limits what the data can talk about to what the data schema has defined already.  

Digital resources thus vary in their structure and allowed values.  Structure is not inherently good or bad — it involves a range of tradeoffs.  What’s best depends on the intent of the contributor.  We need to consider the goal of the resource.  

With data, there’s no obvious intent.  We don’t know why the data is there, or how it is supposed to be used.  But the presumption is that others will use the data, and to make the data useful, it needs to conform to certain standards relating to its structure and allowed values.

With content, the situation is different.  All content is created with an intent in mind. Sometimes that intent is vague or poorly thought out.  But content generally takes effort to create, which means there must be a motivation behind why it exists.  And by looking at the content, we can often infer its intent.  We know there’s an audience who is expected to view the content and we can make some guesses about what that audience is expecting.  The audience’s needs define both the structure and the values used. 

While data doesn’t have an audience, content does. Data doesn’t have an author, while content does.  These distinctions have implications for how resources can be used.  

Representation and ‘aboutness’ in digital resources

What’s the resource about and what’s it trying to explain?  Again, content and data focus on different aspects.

Content and data differ in what each can represent.  Content can discuss a set of facts that don’t have a predefined relationship.  Authors make decisions about what to include and how to talk about them. Content doesn’t need to be routine in what it expresses. Data is meant to convey a predefined range of facts in a precise way.  Data depends on being routine.    

Both content and data make statements, but the values for these statements are dissimilar.  Content elements hold complex values (the value may contain several ideas.)  Data elements hold simple values (normally one idea or ID per value.)

Content deals with topics and stories that ultimately are about themes, ideas, concepts, life events, and other kinds of things that are open to interpretation.  Data describes entities — concrete things in the real world or human-defined records such as invoices.  Data can only describe properties of things that can be measured according to recognized values.  Its role is to provide a consistent understanding of an entity.

A fundamental difference, then, is the approach that each uses to describe things.  Content describes these with natural language, pictures, or other forms of human communication. Data describes them in terms of their measurable properties, or their relationships to other entities.  

Data values are simple and ideally unambiguous: names, IDs, pre-decided labels, quantities, or dates.  Content is different from those simple values.  A content value can’t be easily evaluated by machines because it contains complex, multipart statements about multiple entities.  Lots of work goes into making natural language understandable to computers — finding the themes and sentiments, or recognizing when entities are mentioned.  Despite impressive progress, machines have trouble interpreting human communication.  That machines don’t find human communication reliable does not mean it’s less accurate.  Content can be both more nuanced and more compact than data.  Content may seem less precise, but it also can represent concepts and statements that data representations can’t hope to.  Data engineers are having difficulty representing even the basic features of laws and regulations, for example.

Data breaks down facts into individual statements.  By doing so, data can manage to be both specific and incomplete.  The building blocks of data allow us to say a lot about entities.  We can map the relationships between various entities, including people.  Data can tell about a couple who marry and later divorce.  But it can’t tell us why they married or why they divorced.  The enumerated values within data models can’t address complex explanations.

Data’s fundamental purpose is to make discrete, unambiguous statements about an entity.  Data will hypostatize or reify an object. The object becomes an entify-able value: its existence is stripped down into its observable and quantifiable qualities. Once an entity is converted into values, it can be evaluated.  Even actions can be treated as entities, provided they can be enumerated into categories that are fixed in meaning.   This restriction limits the data-ification of the descriptions of processes since actions can involve so much variation.  

The ability to make discrete, unambiguous statements depends on having an agreed schema to discuss an object’s properties.  Data schemas can either be opinionated or not opinionated (an opinion being a point of view that’s not universally accepted.)   The semantics (what things mean) may involve coerced agreement — much like the terms of service you must click on to use an internet service.  How one asks questions (the syntax) can involve forced agreement as well.  Schemas can contain a range of opinions:

  • Opinionated: Everyone needs to agree to the same schema to talk about what things mean (you must accept my version of the truth.)  
  • Non-opinionated: people can define their own schemas, though they will need to learn about what others have decided if they want to use someone else’s.
  • Opinionated: everyone needs to make requests in the same way (syntax) about a set of facts defined by a schema.  
  • Non-opinionated: how one asks questions can vary and the answers (truth) are language-neutral, independent of how questions are framed. 

A deep irony — and flaw — of many efforts to structure data with a standardized schema is that these initiatives tend to dictate the use of a specific query language to access and utilize the data.  “Openness” is promoted in the name of interoperability but is done by forcing everyone to adopt a particular standard for data or queries — forcing an opinion about what’s correct and acceptable, resulting in pseudo-openness.

A schema is simply a framework that provides a context to what is being discussed.  People use mental schema to interpret the world, while machines use data schemas to do that.

 Content can have meaning independently of its immediate context, provided the content is unambiguous.  A fragment of content can often stand alone, while a fragment of data can’t.  Content doesn’t need a formal schema: it relies on shared knowledge and shared meaning.  The role of structure within content is to amplify meaning beyond the meaning carried by the words or images.  The structure of content frames  scaffolding of meaning.  This represents a big difference in how content and data approach structure.  With content, the emphasis is on using structure to build up meaning: to enlarge ideas.  With data, the emphasis is on using structure to break down meaning: to locate specific details.  

Data has meaning only within a specific context. When viewed by humans, data makes sense only as part of a record that shows the context for the values presented.   For machines, data makes sense within the hierarchical or lateral relationships defined by a schema.  The innovation of semantic data is its ability to describe the meaning of data independently of a larger record.  Semantic data creates the possibility to recontextualize data.  

Data can supply precise answers to structured questions. But data does poorly when trying to convey the meaning of concepts — the big ideas that motivate and direct our behavior.  Complex concepts are difficult to express as data.  They can be given identifiers to disambiguate them from similar words that refer to different concepts.  But even a universal ID, such as a Wikidata ID, does not make the concept of “love” clear as to its meaning.  It merely tells us we aren’t talking about a band with such a name.

Data can supply a comprehensive description only when an entity can be defined entirely by data. For example, statistical categories can be defined by data.   More often, data must rely on concepts that can only be defined using content.  Even if different resources use the same term to describe a property (such as “name”), they may define those terms differently.   

Combining details to build explanations

Another aspect of expression concerns what can be presented together to create a larger whole.  Here, we aren’t looking at what can be said by a single contributor at a single point in time, but what can be combined from different sources created at different times.

Both content and data are becoming more connected: able to combine with other similar resources.   The elements within both kinds of resources are being defined more specifically with semantics indicating their meaning or purpose.  This allows larger digital resources to be composed semi-automatically, a sort of ghost authorship.

When resources are broken into elements, they can be combined into various combinations to create new resources.  On what basis are combinations made?  It’s useful to distinguish logical composition from editorial composition.  Data is concerned with logical composition, while content is concerned with editorial composition.  

Web content historically hasn’t been structured into pieces that could be separated. All content intended for publication would be created at the same time by the same author within a single body field.  The elements within the content were tightly coupled, part of a common template.   Authors enjoyed great freedom about what they could address within an item of content but had less freedom in how their content could be combined with other content they or others created.  The presentation of a web page was fixed.

Content can be composed by stitching together statements, a process done through human curation (via links) or programmatically (by filtering and gathering similar items.)  In both cases, humans decide the rules for what kinds of things belong together to compose meaningful experiences.  The difference is that hard-coded programmatic rules are applied routinely rather than once.  

Data has always been about decoupling elements to allow them to be presented in different ways.  What the data can say might be restricted, but how it can be presented can vary in many ways.  Intrinsically, data has the flexibility to combine with other data.  It’s able to either be assembled on a one-time basis from a situational query or be generated routinely from a saved query that fires routinely.

How useful are larger resources that are assembled from smaller elements?  Not all assembled resources are equally useful.

 It is easier to break apart content into data than it is to compose content from data.  Designing a successful binding agent that turns data into content is difficult. Automated journalism can generate prose from data, but the resulting content comes across as pro forma and far from engaging.

A key difference in the structure of content and data relates to intent.  Data structure specifies the meaning of a value.  Content structure will often indicate the intent of an element as well as the meaning it conveys. For example, an author has a name, which is a data value.  The author may have a bio, which is a content value.  The bio is intended to provide context about the writer, and perhaps spark interest in what they have to say.  The bio presents some facts about the author, but its purpose is broader.

Data elements don’t have predefined purposes.  This allows them to be displayed in any number of combinations.  Content elements are less independent in how they can be displayed.  When content presents a topic, patterns of elements tend to be grouped together, and hierarchies of elements address broader and more specific aspects.  

To build an explanation, the elements of a resource need to work together to provide a unified understanding of a larger topic.  Content management has moved from a tightly-coupled structure to a loosely-coupled one, especially with the development of headless content models.  Elements can now be remixed into new presentations.  But audiences must perceive these elements as belonging together.  Loose-coupling is done by relating content types: collections of editorially-compatible elements.   

The simplest form of relations defines whether something is “part of” another or is “kind of” another.  Both these relationships enable aggregation of elements discussing smaller entities into discussions of larger ones. 

Both content and data are becoming explicit in indicating what individual elements mean. Despite that similarity, they remain different.  People confuse the concepts of structured content and structured data, yet these concepts are different in their orientation.  Both can be building blocks to generate more sophisticated resources.  But the materiality of those blocks is different.  

The structure of content reflects editorial intent.  That intent is often specific to the publisher, which is one reason why standardizations of content structure across publishers have not materialized. The area of technical documentation is a notable exception: it has tried to standardize the expression of content, with mixed results.  Some transactional content is indeed amenable to standardization: cases where people don’t want to think at all because they don’t have an opinion about the material, they trust the advice, and they don’t worry about the consequences of the advice.  API documentation is an example of documentation where the structure used by different publishers is converging, because of the high degree of similarity in its functional purpose.   

But the trend now seems to move away from making content seem like an interchangeable commodity. Content needs to sound human. The rising prominence of intention-focused approaches such as content design and UX writing reflect a growing public wariness toward cookie-cutter content for even “dry” technical and instructional topics.  Audiences don’t trust content that seems formulaic, because it sounds repetitive — even robotic. When all content follows the same limited patterns, it looks the same and people have trouble noticing what is different about each piece.  They question predefined answers that seem too tidy and lack background explanation.  Content needs to sound conversational if it hopes to garner attention — that’s true even for technical, factually-oriented content.  People hesitate if they feel they’re are being blinded by details — snowed.  Every detail should seem necessary to the larger purpose of what they are seeing. 

Editorial intent is about providing coherence to audiences.  The structure of content needs to support coherence.  It’s the opposite of the data-centric approach of the “mash-up”:  a jarring experience involving a mishmash of items that were never meant to be presented together.  

To assemble content elements successfully into a whole, the pieces need to be designed so they fit together. The pieces should support one another to provide a richer explanation than they would if they were viewed individually.  Generic standards for content elements have never seemed coherent: they have prioritized splitting things apart rather than gluing things together.  Most of them focus on factually heavy content that few people would want to read in total.  It’s possible to combine content from different sources, but the editorial intent of each element needs to fit together. The structure should be seamlessly supporting the larger message.  It shouldn’t be fragmenting the topic to where the relationship among elements is not reinforced.

Public expectations for data are different.  For data to be considered reliable, people expect it to look “regular.”  Consumers don’t want to see an asterisk appended to data, a footnote explaining some nuance.  Data is less trustworthy when posing as solid facts but presented tentatively or teasingly.   It also becomes less trustworthy when it is presented in inconsistent ways.  We expect data to be predictable and wonder what’s being hidden if it is presently in an unexpected manner. Predictable routines are more effective for data.  The regularity in data makes it seem more trustworthy and reliable.  People expect data to offer accuracy, not broader meaning. 

While content and data are different, there’s still a big push to treat content as if it was data.  We’ve seen how earlier attempts to do this, such as mash-ups, were widely unsuccessful because they were incoherent and fragmented the experience for audiences.  More recently, data enthusiasts have promoted the application of semantics to eliminate differences in how publishers describe and use their content.  It’s important to address the potential and limitations of semantics. 

Digital resources rely on semantics. But in the view of some, these resources don’t rely on semantics enough.  Some argue that all digital resources can and should be described with a common model: the RDF data model.   

The RDF data model represents the vision of what the inventor of the internet, Sir Tim Berners-Lee, thought the web should become: a connected body of commonly described data, or linked data. I’ve seen efforts to use the RDF data model to publish content rather than data.   But I’ve never seen an RDF data model that reflects editorial priorities, structuring content the way audiences expect them.  RDF works well for data but when used for content, it delivers dull, fact-based pages that at best resembles an encyclopedia. An interesting art museum collection gets transformed into a lifeless database dangling with confusing categories and options. These efforts reflect a belief that details are more important than narratives.   

Wikipedia, the most popular encyclopedia on the web, starts with content, rather than data. Wikipedia offers templates to provide an editorial framework for the content, which enables structural uniformity but doesn’t impose it.  Wikipedia is one of the most advanced experiments attempting to harmonize content with a data model, but it highlights the limitations that even a single publisher can have doing that.  It’s possible to extract data from Wikipedia but the complexity of its content has defied efforts to normalize it into a predictable data model.  

How resources are consumed and used

The flip side of expression is interpretation.  The meaning of a resource is not only a function of how it’s represented — the semantics.  Its meaning depends on how it’s received and evaluated — what humans and machines notice and understand.  

 The possibilities for machine-assisted interpretation are growing as data becomes richer, especially when it is highly curated.  People consume data — the fields of data visualization and data storytelling, for example, have exploded in the past decade, and these human-centered experiences can be machine-generated.  And we can no longer assume that stuff traditionally considered content — narrative text, for example — is never consumed by machines.  The range of what machines can do keeps growing, but they remain fundamentally different from humans in what they aim to accomplish.  

Who’s the resource for?

Let’s look at who needs or will want to use a resource. Does everyone need it, or only some people?  Many debates about the importance of standards used to describe resources arise because of different presumptions about who needs them.  Often the audience for the resource is never clearly defined.

Those who advocate for standards believe that everyone needs a shared basis of understanding: a common schema. A related belief is that everyone needs to know the same kinds of things from the resource, so using one common standard will be adequate for everyone.  Those who don’t embrace common standards consider them as extra work, often getting in the way of what’s needed to provide a complete explanation.    

If everyone agrees about the semantics — what to describe and what that signifies —  it can amplify the understanding and the potential utility of resources.  But reaching a universal agreement is difficult to achieve because people in practice want different things from resources.  They have different goals, preferences, priorities, and beliefs about the utility of various resources. What gets represented in standards is a reflection of what the standards makers want to be consumed.  Some people’s priorities don’t fit within what the standards committee is interested in supporting.  And when resources rely on semantic conventions outside of the agreed mainstream, it can potentially limit how those resources are acknowledged by IT systems that rely on these mainstream standards.  Standards are a closed system: they encourage connections within the universe of standards-adopters while indifferently ignoring outsiders.  Common standards can promote shared understanding. But they can also throttle expression by limiting the scope of what can be said.  

Data needs a schema to indicate how different bits of information are related to one another — otherwise, the data is not intelligible. Many data schemas are self-defined to reflect the requirements of the publisher.  Some data only moves around within a single publisher.  But because data is commonly exchanged between different parties, data will often utilize a common schema of some sort.  Externally-defined standards support exchanging resources with other parties. But it’s a mistake to see them as frictionless and cost-free.  

Digital resources that adhere to “open standards” present themselves as universally usable and available.  But in practice, they can become a walled garden that’s not accessible to all, due to stealth technical hurdles or barriers.  To understand this apparent paradox, one first needs to step back and ask how inclusive the standard’s development and adoption are. Standards are created by committees of often like-minded people who have specific agendas about what they want the standard to do.  If a narrow group was responsible for developing a standard, it may look like a consensus but it may never have achieved broad interest and wide adoption.  This can happen when standards are opinionated — they seem to work great, as long as one’s willing to sign on to the concepts, presumptions, and limitations of the standard.  And every standard has these constraints, though few are keen to advertise them.  Practices that are looser in their demands tend to be more flexible (and practical) and hence more widely adopted, and in the process become de-facto standards that are more impactful than many official ones.  

Data relies on APIs to broker what’s needed by different parties.  For people using apps that must access data from various sources, machines need to exchange data and APIs indicate how to do that.  Frequently, only select data needs to be available to a limited number of other parties. In order words, everyone does not need everything. A lot of data is personal to an individual: your streaming music playlist, your car service history, or your vacation plans.  While some people want to broadcast their personal data, many people are reluctant to do that, concerned about their personal privacy and cybersecurity.  

Most data is exchanged using custom APIs, where the provider of data describes its schema. Most data does not conform to a universal standard such as the RDF data model, and only a minority of IT systems support RDF.  

Yet some data is considered public and is made universally available.  Data described with RDF can connect easily to other data that are  described with that model.  RDF is most commonly used for non-proprietary data since there’s an incentive to connect such data widely.  

The audience for RDF-defined data is unique in several respects.  RDF schemas tend to be designed by technical people for other technical people: engineers, scientists, or data analysts. The people creating the schema are largely the same people using the data described by the schema. They share a mental model of what’s important. The data that’s targeted is presumed to be a public good: it should be available to everyone, who can access it by using a commonly adopted schema.  Semantic standards assume that data must have a universal identifier because anyone might need to use the data in its raw form.  This sentiment is most expressed in the notion of “open data”: that data must be downloadable and reusable by anyone — you agree to surrender rights to how what you create is used.  The “authorship” of data, unlike content, is often anonymous.  Data is easily replicated, becoming widely available.  It can quickly become public domain, where it’s not unique enough to merit copyright.

Unlike data, content is neither private nor public.  It’s meant to be widely used but normally it’s neither individually-focused or universally needed.  Not everyone will want or potentially need the same content.  They may want the content from specific sources since the source of the content is part of the context audiences consider when evaluating the content. Content is generally proprietary: it isn’t meant to mix freely with content from other sources.  Content normally has an identifiable individual or institutional author, unlike the case with much data.  To have value, content should be unique, and by extension be subject to copyright.   

For a scientist, any data is potentially useful because you never know ahead of time what facts might be connected to others.  For a teenager viewing a smartphone screen, only some content will ever be of interest.  Content, more than data, is situationally relevant; the requirement of having a universal identifier is not as great.  Content needs an identifier of some sort, but since not everyone in the world needs to be able to access all content, every piece of content doesn’t need a globally unique ID.  Most people only need a way to access the assembled content, such as a web page.  They don’t need access to the individual elements that comprise that web page. That task is delegated to an API, which worries about interpreting what’s wanted with what’s available.

Government-funded organizations often have an obligation to make everything they publish be described using an open standard. Public funding of data tends to drive data openness.  But the data itself may only be of interest to a small group of people.  Just because data utilizes common standards doesn’t imply there’s a universal demand for the data.  Data about the DNA of fruit flies or archeological artifacts from Crete may be of public interest, but they aren’t necessarily of universal interest.

What’s available to use? 

How resources can be used to some extent depends on the relationship between the parts and the whole.  

  • Do the parts have meaning when separated from the whole?  
  • Do different parts when combined resemble a coherent whole?

We can think about digital resources as fulfilling two different scenarios.  First, we have cases where the user is expecting something specific from the resource: the known-knowns.  They have a question to get answered, or a predefined experience they seek, such as listening to a specific music track.  Second, we have cases where the user doesn’t have strong expectations: either known-unknowns or even unknown-knowns.  

The issue more complex than first appears because we haven’t fully defined who the user is.  Does the user directly make their choices on their own, or do they rely on an intermediary (a machine or editor) to guide their choice?

What’s the resource for?

Both data and content can answer questions, but the kinds of questions they can answer are different.  Questions vary:

  • What does the user want to know?  
  • Do they have a specific question in mind?  
  • Does the answer provide a complete picture of what they need to know, or does it hide some important qualification or comparative? 

This distinction between data and content has practical consequences for the design of APIs that offer answers. What can be queried?  Does the value need a label to explain what it is?  Is the author of an API query a curator of content, or the seeker of knowledge?  A curator intermediator will be inclined to read the API documentation, while the knowledge seeker wants direct answers.

Both people and machines consume resources — they try to make sense of them and act on what they say.  When resources are consumed, the identity of what’s being discussed in the resource can be a vexing issue.  What is it talking about, and is it the right thing of interest?  Data IDs can help us be more precise in specifying and locating items, but they can also be misleading.  Both machines and humans presume that identical strings of characters indicate equivalent items.  In the case of humans, this is most true when the string is uncommon and assumed to refer to something unique.  Machines are normally less discriminating about how common something is: they are greedy in making inferences unless programmed not to.  Machines assume every matching set of strings refers to the same thing.  

Of course, just because different items have the same string label or ID, that doesn’t mean they are referring to the same idea.  The hashtags in folksonomies illustrate this problem.  Catchy words or phrases can become popular but have diverging meanings.  

Humans have trouble when talking about concepts, because the same word or phrase may imply different things to different people, what’s known as polysemy.  Nearly everyone has their own idea about what love means, even while nearly everyone is happy to use this four-character string.  Identity (labels or IDs) is not the same as semantics (shared meaning).  In natural language conversation, ambiguity is a recognized problem and to some extent expected, with clarifying questions a common countermeasure.  Suppose a speaker is talking about “personalization.”  The listener may wonder: what does he mean?  The listener may supply her own definition, which may be the same as the speaker’s, or slightly different — or vastly different.  The listener may use a different name to refer to what the speaker considers “personalization” — to the listener, the speaker is talking about “customization.”  It’s also possible that the speaker and listener agree about the definition of personalization, but still conceptualize it differently: one sees it as a specific kind of algorithm while the other considers it a specific kind of online behavior.  Even shared definitions don’t imply shared mental models, which involve assumptions outside of simple definitions.  

 Concepts, whether familiar or technical, often lack precise boundaries or universal definitions.  Humans often encounter others using the same terminology to refer to a concept, only to find that others don’t embrace the same meaning or perspective about it.  

Much of what passes for semantics in the computer realm is more about agreed identifiers than agreed meaning.  A “thing” that can be precisely identified can have multiple identities — depending on who is perceiving.  Machines are unable to distinguish the role of denotation from connotation.  Consider an emoji of a facial expression, which has a specific Unicode ID.  The Unicode ID denotes a visual face of some sort.  But the meaning of that facial expression emoji is subject to interpretation.  A single Unicode character can generate polysemy — multiple interpretations.  A Globally Unique ID can spark false confidence that everyone will understand the entity in the same way.

The promise and peril of IDs are that they act as stand-ins for a bunch of statements.  In natural language, we’d call them loaded-phrases: they trigger all kinds of associations. Is the terminology or labels used for a concept the changing, or is the identity of concept changing?   But maybe the ID isn’t precise?  We tend to describe things, and how we describe things tends to define them.  Maybe the descriptive properties aren’t relevant to what’s being said.  Maybe they aren’t useful at all.  

 Shades of truth

Data are about indirect experience.  They provide detached information that’s been recorded by someone or something else.  Data are generally treated as “facts” — they’re considered objective.  Content doesn’t represent itself so absolutely.  It speaks about the author or reader’s direct experience — personally acquired or understood information.  Content is about expression, which involves individual interpretation.  Notionally, data is free of interpretation, while content supplies it.  

As mentioned earlier, data is exchanged between systems and so needs to be defined with enough precision to allow that to happen.  The value of data (in theory at least) is independent of its source.  Its utility and accuracy are presumed so where the data comes from is less a concern.  Do we care who publishes a list of US presidents or where Google gets its answers presented in its knowledge panel box?  If we believe data is intrinsically objective — that data isn’t false or misleading — we don’t.  

Content — subjective and embodying an editorial point of view — is a reflection of its publishing source.  Audiences evaluate content partly by the reputation of the source. They evaluate content not just for its factual accuracy but its completeness, fairness, insight, transparency, and other qualities.   People value content for being unique.  Content can’t always easily measured or compared directly against other content.  People rely on their judgment rather than some calculated comparison.  

The presumption of knowledge graphs is that KNOWLEDGE (yes it’s a big shouty idea) can be reduced to data.  A less glamorous name for knowledge graphs is linked data, the term that Tim Berners-Lee championed two decades ago. The terms knowledge graphs and linked data are nearly synonymous, the main difference being that knowledge graphs involve a curated set of data (curation = editorial choices).  Because knowledge graphs are promoted as the answer to nearly every problem and possibility facing humankind, it’s important to be clear what knowledge graphs can and can’t do.  Knowledge graphs advance our ability to connect different facts and bring transparency to many domains. They are especially useful in supporting internally-focused enterprise use cases where experts share data with their colleagues, who can interpret the meaning and significance based on a shared understanding of its significance. Knowledge graphs are indeed useful in many scenarios.  But we must accept that the “knowledge” aspect of the term is a bit of marketing hyperbole, coined by Google.  Data, by itself, no matter how elaborately connected and explicated, does not generate knowledge.  Because knowledge requires explanation, while data can only show self-evident things.  Content, in contrast, is about presenting information that’s not self-evident to audiences.  Content explains facts.  Knowledge graphs link facts.  

Many possibilities are available to connect items of data together especially when joining tables or federating search across different sources.  But the output of data queries is still limited in what it can express.  Data are related through operators, such as equals (is), comparisons such as more than or before, and inclusion (has).  That’s a small set of expressions compared to natural language.  Semantic data is more expressive than ordinary relational data because its properties function like verbs in natural language.  But the constraint remains. For data to be manageable, the properties must be enumerated into a limited list of verbs.  Nuance is not a forte of data.  

Data’s relevance challenge

How does data establish its relevance to audiences?  Unlike content,  data was never created with a specific audience need in mind.  So how can it become relevant to people?  

Content APIs are designed around basic questions: what content will be relevant to audiences?  The API delivers specific content that provides the answer.  The body of content will be relevant to different audiences for different reasons, though no single individual will necessarily be interested in all of it. The content was created to answer specific questions.  The API’s task is to determine which slice within the content body is needed for a specific context at a specific time.  Answer-responses can be programmatically delivered using a flexible range of simple declarative questions.

Data, being more open-ended in how it can be queried, has more difficulty providing relevance reliably.  Data can answer straightforward questions easily: what’s the cheapest gadget that’s in stock, or the highest-rated gadget?  While the result doesn’t provide much explanation, the answers can point audiences to content items they may want to view content to understand more.

Knowledge graphs are supposed to turn general-purpose data into something that will be understandable and relevant to the casual inquirer. They’re able to do that when the user understands the domain already, but even then the relevance of results is uncertain.  Using knowledge graphs to extract valuable insights from data can be a deeply labor-intensive exercise  — a treasure hunt in an age when consumers expect systems will automatically tell them what they need to know.

Knowledge graphs are generally built from graph databases, which connect different types of entities and reveal their indirect relationships.  These databases have been difficult to translate into audience-facing applications.  People don’t know what to ask when dealing with indirect relationships: what relationships are potentially valuable to explore.  For this reason, we see few consumer applications using the SPARQL, the most commonly used specialized query language for knowledge graphs.  

Instead, most consumer recommendation engines operate using different sort of graph database called property graphs that are based on self-defined semantics rather than universally agreed ones.  Property graphs are optimized to find the shortest path or distance between two objects.    The goal of the recommendation is to highlight the strength of the connection rather than providing a way for the customer to drill into the multifaceted relationships.  Recommendation engineers are fundamentally different in purpose from faceted filtering.  They remove complexity from the user, rather than expose users to it.

The goal of using semantic data to “reason” — to draw non-explicit conclusions — hasn’t materialized in mainstream consumer applications.  Most semantic data is used to answer explicit questions, albeit sometimes highly complex ones.  One benefit of semantic data over traditional data is that the questions asked can be more elaborate because each item of data ultimately is relatable to other data.  The cost of this benefit is high: the answer to any question may be null — because it’s not obvious what questions yield useful answers.    The general public has limited interest in chaining together elaborate queries. They tend to ask straightforward questions.  

When questions become complex and topics are opaquely enmeshed, systems can’t expect audiences to articulate the question: they need to anticipate what’s needed.  While experts are interested in how an answer was arrived at, ordinary users on the web are more interested in getting the answer.  They want to be able to rely on curated queries that ask interesting questions that have meaningful answers.  

Richness and ambiguity in content and data

Whether content or data provides richer expression depends on many factors.  Both can be vague.  Using many words won’t necessarily make things clear.  And presenting many facts doesn’t tell you everything you need to know.

One measure of the value of a resource is its succinctness. A succinct resource can convey much.  Alternatively, it may flatten out important nuances.  Data’s value is associated with its  facticity, where facts speak for themselves and no one entertains deviant ideas about what those facts mean.  Content’s value isn’t about crystalized facts or named entities.  While content is often reduced to being about words, it’s actually about wording — interpretation, understanding, and feeling.  

Data provides confident answers to narrowly defined questions.  Content provides richer but more uncertain answers to broadly-defined ones.  

The limits of decomposition

People learn a lot by focusing on the core facts — nuggets that can be translated into data — within statements in content.  But that focus also carries a risk: the attractive but mistaken idea that all content can be boiled down to data.  People often want direct answers, which we expect can be filtered from facts.  But answers also often involve interpretations, which go beyond agreed facts. Our interpretations can diverge, even if  everyone agrees with what the facts are.

Consider a common, simple question: who is a movie for?  The content of a film can be reduced to a maturity rating. In principle, these are based on clear criteria, but in practice, they can be difficult to apply unambiguously and consistently.  Classification is often based on patterns rather than criteria satisfaction.  And the maturity rating still doesn’t tell us much about who the movie is for, even if we agree the rating is accurate.  

According to the ideas of semantic data, almost anything can be defined in terms of its properties.  An entity should explainable through its data values.  In practice, such descriptions only work for simple entities with regularized parameters — generally human-made things or abstract archetypes.  Complex things that have evolved organically over time, whether products, living beings, or ideas, can’t be easily reduced to a handful of data parameters.  They defy simple data-explication, because of:

  • Tacit qualities that can’t be articulated easily  
  • Complex attribute interactions, including irregular combinations and exceptions
  • The overloading of dimensions, where there are too many dimensions to track easily

The mundane act of identifying a bird species illustrates the issue. For many related species, it can be hard to do, as each bird species can have many attributes that vary, and many bird species have similar attributes that generate confusion in identification.  Birders rely on evaluating the “jizz” of birds. Many categories we rely on are concepts that have general tendencies rather than absolute boundaries.  For this reason, criteria-based definitions don’t aren’t satisfactory.  

Data is not also good at representing many processes and how they can change the state of things.  Consider the many ways an onion can be processed resulting in different outcomes with different property values.  The cut of onions (diagonal, minced) and cooking method (steamed, sautéed, stir-fried, deep fried) influences texture, sweetness, and so on.  The input properties influence the contours of the output properties but don’t dictate them.  There are too many subtle variables to model accurately in a data model.  Content is more efficient discussing these nuances.

Food is born from recipes: a structure of inputs yielding an output.  It’s become a stock metaphor for illustrating how content or data works.  Those who believe “content is data” love to cite the example of food recipes.  Recipes are structured, they follow a standard convention, and deal with entities and quantities.  But contrary to what most people think, recipes aren’t really data.  They are content.  They are full of nuance.

A recent semantic data research paper explored how knowledge graphs could allow substitutions in recipes.  The authors, at IBM and the Rensselaer Polytechnic Institute, explored the potential of a “knowledge graph of food.”  They had the goal of creating a database that could allow people to switch out a recipe ingredient to use something more healthful. The authors figured ingredients can be quantified according to their nutritional values: vitamins, calories, fat content, etc.  “We use linked semantic information about ingredients to develop a substitutability heuristic for automatically ranking plausible ingredient substitutions.” While the goal is laudable, the project in many ways was naive.  The authors hoped that a simple substitution would be possible with no second-order effects.  But ingredients interact in complex ways, much like the details of content do. If you want to replace cream in a recipe, there are many options with different tradeoffs. Suppose we think about recipes as formulas of ingredients.  When one ingredient changes, it changes the formula and the outcome.  The problem is that food isn’t only about chemical properties.  It’s about experiential properties.  People react to food according to its sensory qualities: taste foremost, but also texture, aroma, and appearance.  The authors acknowledge that “the quality of an ingredient substitution can be subjective and difficult to concretely determine.”  But they don’t dwell on that because it’s a problem they can’t solve.  Any cookbook writer will tell readers that substitutions are possible, but they will change the character of the dish and may necessitate other ingredient changes and compensating measures.  

When resources discuss things that are important priorities to people, they will address numerous factors they care about.   The whole will be more complex than the sum of its parts.  

Content and Data as Equals

People rely on both content and data.  It’s a mistake to view one as superior to the other.  

From the perspective of user experience, people relate to events in the world on three levels:

  1. Phenomena that we perceive or notice: the content of our experience
  2. How we identify, classify, and characterize that phenomena: the facts or data we ascribe to what we’ve encountered
  3. How we explain and evaluate the phenomena: the content of our personal explanations

Our perceptions translate into properties of an entity.   Our characterizations translate into its values.  Our explanations are complex and open-ended, like content.

Data can’t replace content.  A knowledge graph query will tell you very little about the state of knowledge graph research.  You’ll need to read PDFs of conference papers to learn that.

And modern digital content can’t be viable without drawing on data to explain entities in concrete detail.  Data can highlight what’s available within the content and clarify its details.  Data enhances content.

Standards are valuable when many different people need to do the exact same thing. A basic philosophical disagreement concerns how similar or dissimilar people’s needs are.  On one extreme is the tendency to consider every resource as unique, with no reusability. That’s the old model of web content, where data was a second-class citizen.  On the other extreme is the belief that everyone needs the exact same resources and that these can be described with standardized structure.  People cease being unique.  Content is a second-class citizen.

Content needs to move forward to develop internal publishing standards for structure: internal schemas that allow coherent assembly of different parts into meaningful wholes. That improves the range of combinations that content can present to be able to address the needs of different individuals. But the standardization of content structure can only go so far before it gets homogenized and loses its interest and value to audiences.  It’s not realistic to expect content publishers to adopt external schemas for content: they will lose their editorial voice if they do so.  

For data to increase its utility, it will need to start considering its editorial dimensions, especially who the data is will be for and what do those people need. As public wariness toward data collection by big institutions builds, ordinary people need to feel data can serve their needs and not just the needs of experts or powerful corporations.  Data curators are needed to look at how data can empower audiences.   

Data and content both have structures, which allow them to be managed in similar technical ways. Both content and data can be queried with APIs, and APIs can combine both content and data from different sources.  But despite that commonality, they have different purposes.  The goal should be to make them work together where it makes sense to, but not expect one to be hostage to each other.  

The technical possibilities for transforming content and data have never been greater.  It can be easy to become dazzled by these possibilities and idealize how beneficial they will be for people.  Too little of the discussion so far has focused on what ordinary people need. Experts need to engage more with non-experts to learn what makes digital resources truly meaningful.  

— Michael Andrews