Categories
Content Engineering

Making it easier to build structured content


A quick note about an amazing UI from Writefull called Sentence Palette that shows how to combine structured content with prompts to write in a structured way. It nicely fleshes out the granular structure of a complex document down to the sentence level. I see broad application for this kind of approach for many kinds of content.

The deep analysis of large content corpora reveals common patterns not just in the conceptual organization of topics but also in the narrative patterns used to discuss them.

Writefull has assessed the academic writing domain and uncovered common text patterns used, for example, in article abstracts. LLMs can draw upon these text fragments to help authors develop new content. This opens up new possibilities for the development of structured content.

— Michael Andrews

Categories
Content Engineering

Revisiting the difference between content and data

This post looks in detail at the differences between content and data.  

 TL;DR — I understand you’re too busy to read a 10,000+ word post.  No worries, I’ve pulled out some highlights in the below table.  If this data doesn’t answer your questions, you may have to read further.

ContentData
Open-domainClosed domain
No restrictions on expressionExpression is restricted
Has an intentDoesn’t have an intent
Has an authorOften anonymous
Complex valuesSingle, unambiguous values
Topics and narrativesEntities
Structure builds resourcesStructure defines boundaries
Facts discussed outside of pre-defined relationshipsFacts described through pre-defined relationships
Assembly structure is coupledAssembly structure is decoupled
Editorial compositionLogical composition
Has a specific audienceIs universally relevant
Nuanced in meaningMeaning is standardized
Focused on audience interestFocused on resource reusability
Meaning can be independent of contextMeaning is dependent on context
Build up meaningBreak down meaning
The bigger themeThe details
Often proprietaryOften public domain
Uniqueness is valuedRegularity is valued
Cares about audience attentionDoesn’t care about audience attention
Some differences between content and data

What’s the issue and why does it matter?

New approaches to managing content and data such as headless content management and knowledge graphs are altering how we work with digital resources on a technical level.  Those developments have made it even more important to address what humans need when it comes to content and data, especially since the needs of people so far have not been at the forefront of these technological developments.  Too often, the resource is considered more valuable than the people who might use the resource.  It’s time to define future approaches to accessing content and data in a more human-centered way.   The first step is to be clear about the differences between how people use content and data.    

While the difference between content and data may seem like an idle philosophical question, it goes to the heart of how we conceptualize and imagine our use of resources online. Probing the distinction allows us to examine our sometimes unconscious assumptions about the value of different resources.  Content and data are basic building blocks for communication and understanding.  Yet I’ve been surprised how differently professionals think about their respective value, potential, and limitations.  My thinking about this topic has evolved as well, as I watch changes in the possibilities for working with both but also encounter sometimes idealized notions about what they can accomplish.  

Our mental model of digital resources influences how to work with and value them.  Many people consider content and data as different things, even if they can’t delineate precisely how they differ.  They often have different perspectives of the value of each.  Some popular tropes illustrate this.  Data is the “new oil” — the value of content is to generate or extract data.  Content is “king” (or queen) – data exists to support content.   In both these perspectives, content or data are seen as raw material — a means to an end — though they disagree on what the end is.   

When we talk about content and data, we rarely define what these terms mean. This situation has been true since the early days of the web.  We’ve made little progress in understanding what various digital resources mean to people and the picture keeps getting more complex.  Content and data are becoming more intertwined, but they aren’t necessarily converging.  

On a technical level, various mental models people have about resources get translated into formal schemas.  How we architect resources influences how people can use them. 

Are content and data different categories of resources?

Professionals who work with digital resources in different roles don’t share the same understanding of how content relates to data.  Until recently that wasn’t a big problem, because content and data lived in separate silos.  People who worked with content could comfortably ignore the details of data, and those with a data focus were indifferent to content.  

Outside of those who create and manage content, content is still largely ignored in the IT world.  There’s always been more of a focus on information or knowledge.  When you avoid discussing content and understanding its role, you can slip into a shaky discussion about delivering information or providing “knowledge” to users without considering what audiences actually need.  

But the historical silos between content and data are slowly coming down, and it’s becoming more important to understand how content and data are related.  Unfortunately, it’s not so simple to express their differences succinctly, because they vary in many different dimensions.  They aren’t just words will simple definitions, but complex concepts.  

Everyday definitions of content and data don’t help us much in locating critical distinctions.  For example, Princeton University’s Wordnet lexical database provides the following short definitions of each:

  • Content: message, subject matter, substance (what a communication that is about something is about)
  • Data: information (a collection of facts from which conclusions may be drawn) “statistical data”

These definitions hint at differences, but also areas of overlap.  The distinction between a “message” and “information” is not obvious.  In everyday speech, we might say “Did you receive the message (alternatively: information) today?”  Similarly, the ideas of “substance” and “facts” seem similar.  It’s tempting to dismiss these problems as the byproduct of sloppy definitions, but I believe many of the difficulties stem from the complex and changing essence of these concepts.  

When concepts are difficult to define, many people look for examples to show what something means.  But canonical examples of familiar resources don’t help us much.  Let’s consider two ink-and-paper products that can be purchased from the US Government Printing Office.  The US Census is a canonical example of data: a compilation of cold, statistical facts.  The US Constitution is a canonical example of content — a series of statements that are rich with meaning — so much so that people passionately debate what every word means. Canonical examples can provide concrete illustrations of concepts, but most of the resources we deal with will not be so well defined.  

In the past, we categorized resources according to end user: data was for machines to process and calculate, while content was for humans to understand. Or we conceptualized content and data by the environments in which we encountered them.  For example, we viewed data as rows and columns of text and numbers in a spreadsheet or relational database.  While these could be presented to readers in a table, the rawness of the source material did not make it seem like content.  As computers began to process all our resources, the picture became more complicated.  Computers could store records, such as my university transcript, which provided an overview of my studies — a story of a sort.  Computers also stored documents.  I began using computers to create documents while in university, even though the delivered document was on paper.  Was the file on that floppy disk content or data? Conversely, I used card stock paper to create data for computers, by filling in the dots on a computer punch card — manually processing data for the benefit a comptuer.  

Based on past experience, people are often inclined to see content and data as different.  It’s also worth considering how they might be related. Earlier this month, a developer I know posted on a content strategy forum that “content is data.” Professionals working with digital resources may view content and data as having a range of relationships:

  1. They are identical (or any avowed distinctions are inconsequential)
  2. They are separate and independent from each other, with no overlap
  3. They overlap in some aspects (either sharing common properties or representing a continuum)
  4. One is a subset of the other (content is a kind of data or data is a kind of content)

I’ve encountered people who promote any one of these views, and among others.  It’s even possible to see data or content as an expendable resource with no inherent value.  For researchers in natural language processing, content is merely a “data set” — a very long string of characters to bend into shape to support different use cases.  

My own view is that content and data are fundamentally different, with only limited overlap. They coexist and complement each other, but they are distinct kinds of resources. In one specific dimension, they are becoming more alike: how they are stored and can be managed.  But they are very different in two other areas: how they are generated and created, and how they are consumed and used.

When viewed solely through the lens of technology, content and data can seem to resemble each other.  Both can be structured by models that are similar in form. Much of what makes content and data different is invisible to technology: they have different purposes, and people relate to them in different ways.  A growing source of confusion arises when individuals assume that content and data should behave the same way because they have similarities in form.  But morphological similarities are not the full story, just as the wings of pigeons, penguins, and ostriches look similar but play different roles.  

How content and data are stored and managed

Digital resources have a material presence. They take up space.  I live only a few miles from where one of the largest concentrations of server farms in the US is hosted and am keenly aware of the land and energy they need.  What’s lurking there?  What are these resources talking about?

Many developers see no intrinsic difference between data and content: both are simply digital objects with IDs.  Different models of storage — branching code repositories, file structures, XML encodings, graph databases, schema-less databases — are simply alternative ways to organize bytes of data.   Some developers refer to content as unstructured data.  According to that perspective, content is a form of data, but in a less perfect form.  The best that content can aspire to is to become “semistructured” data.

Developers encounter the terms content and data in jargon referring to the format of the resource they are dealing with.  They deal with “content types” (for HTTP requests) and “data types” (for data values stored or retrieved).  For example, text can be plain or HTML.  In this sense, content and data aren’t too different: they define a discrete payload.  Small wonder developers don’t spend much time pondering the distinctions.

Within a content management system, the term “content types” appears again but in a different sense: they define the fields for an item of content, and each field needs a data type to indicate the kind of value used.  Here, content types are made from data types and might be considered a superset of data types.

In the CMS-specific definition of content types, we see signs of semantics, which break the resource into parts that have names.  The semantics help describes what the resource is about, not just how to process it as a format.  Provided a developer knows the structure of the resources, they can query it to obtain specific elements within the resource to answer questions.  

What do the parts of a resource offer and how can they be used?  These questions are often answered in API documentation, which explains the model of the resource.  Most serious CMSs now have a content API; better ones expose their entire content model. This is having radical consequences.  Content is not tied to any specific display destination such as a website, but instead becomes a dynamic resource.  With GraphQL, a fast-growing API query standard, it’s become easy to combine content from different sources.  Content publishing has the potential to become multisource: distributed and federated.  Content can be exchanged between different sources, which don’t need to have the precise same understanding of what originating source had in mind.  APIs are like a phrasebook that translates between different parties.  People don’t need to speak the identical language (schema) provided that they have the phrasebook.  This is a major shift from the past when the presumption had been that everyone needed to agree to a common schema to share their resources.

Non-technical users gain access to the model of a resource by using a UI that’s connected to it.  With the shift to the cloud, consumers are increasingly unconcerned about where resources are stored.  Many resources are accessed through apps.  Some browsers now don’t display full URLs.  The address of the content — its path and location — is disappearing and along with it a concern about structure.  While the cloud is just a metaphor, it does capture the reality that resources no longer have a fixed address.  Storage containerization means resources move around.  From the consumer’s perspective, the structure is becoming invisible — seamless.  They don’t see or care about how the sausage is made.  They only care what it tastes like.

In terms of access and storage, content and data are becoming more alike.  With APIs, content is becoming more malleable, like data.  And data is getting upgraded with more semantics, becoming more descriptive and content-like. These changes are largely invisible to the consumer, until they don’t work out.  People do notice when things are askew, though they are unsure why they are.  They care not only whether the files can be accessed, but also how coherent the experience is of getting that stream of resources.

Generating and creating resources: expression

What can a digital resource talk about?  Content and data are often discussing different concerns.  Content talks about topics or stories.  Data talks about entities.  While similar in focus, content and data have different expressive potentials.

Data and content have different perspectives about: 

  • What they can mention: the properties that are discussed
  • What can be said about said about those things: the values presented  

Restricting expression

The structuring of resources will influence what resources can discuss.  And they can impose rules on values (controlling values, validating values, or restricting values), which will further limit what can be said.  Structure can limit expression. 

Content is “open domain”: it can talk about anything.  The author decides what topic or story to talk about – they are not restricted to a pre-determined universe of topics.  Once that story or topic is chosen, they can address any aspect they want to, and there are no restrictions about the attributes of the topic.  And they can say anything they want to about; there are no restrictions about the values of those attributes.    

Content, in its untamed form, is open-ended in how it discusses things. Content doesn’t require a pre-defined structure, though it’s possible to structure content to define properties that shape what dimensions get discussed.  These may be specific fields or broader ones.  Even when content is defined into structural elements, the range of values associated with these elements is open-ended (such as free text, video, or photos) and the values can discuss anything without restriction.  Restrictions on content are few: there may be a character limit on a text field or a file size limit on an image.  A few fields will only accept controlled values.  But in general, most content is composed of values created by an author.

 Data is “closed domain”: to be useful, data must be defined with a formal schema of some kind — a set of rules.  This limits what the data can talk about to what the data schema has defined already.  

Digital resources thus vary in their structure and allowed values.  Structure is not inherently good or bad — it involves a range of tradeoffs.  What’s best depends on the intent of the contributor.  We need to consider the goal of the resource.  

With data, there’s no obvious intent.  We don’t know why the data is there, or how it is supposed to be used.  But the presumption is that others will use the data, and to make the data useful, it needs to conform to certain standards relating to its structure and allowed values.

With content, the situation is different.  All content is created with an intent in mind. Sometimes that intent is vague or poorly thought out.  But content generally takes effort to create, which means there must be a motivation behind why it exists.  And by looking at the content, we can often infer its intent.  We know there’s an audience who is expected to view the content and we can make some guesses about what that audience is expecting.  The audience’s needs define both the structure and the values used. 

While data doesn’t have an audience, content does. Data doesn’t have an author, while content does.  These distinctions have implications for how resources can be used.  

Representation and ‘aboutness’ in digital resources

What’s the resource about and what’s it trying to explain?  Again, content and data focus on different aspects.

Content and data differ in what each can represent.  Content can discuss a set of facts that don’t have a predefined relationship.  Authors make decisions about what to include and how to talk about them. Content doesn’t need to be routine in what it expresses. Data is meant to convey a predefined range of facts in a precise way.  Data depends on being routine.    

Both content and data make statements, but the values for these statements are dissimilar.  Content elements hold complex values (the value may contain several ideas.)  Data elements hold simple values (normally one idea or ID per value.)

Content deals with topics and stories that ultimately are about themes, ideas, concepts, life events, and other kinds of things that are open to interpretation.  Data describes entities — concrete things in the real world or human-defined records such as invoices.  Data can only describe properties of things that can be measured according to recognized values.  Its role is to provide a consistent understanding of an entity.

A fundamental difference, then, is the approach that each uses to describe things.  Content describes these with natural language, pictures, or other forms of human communication. Data describes them in terms of their measurable properties, or their relationships to other entities.  

Data values are simple and ideally unambiguous: names, IDs, pre-decided labels, quantities, or dates.  Content is different from those simple values.  A content value can’t be easily evaluated by machines because it contains complex, multipart statements about multiple entities.  Lots of work goes into making natural language understandable to computers — finding the themes and sentiments, or recognizing when entities are mentioned.  Despite impressive progress, machines have trouble interpreting human communication.  That machines don’t find human communication reliable does not mean it’s less accurate.  Content can be both more nuanced and more compact than data.  Content may seem less precise, but it also can represent concepts and statements that data representations can’t hope to.  Data engineers are having difficulty representing even the basic features of laws and regulations, for example.

Data breaks down facts into individual statements.  By doing so, data can manage to be both specific and incomplete.  The building blocks of data allow us to say a lot about entities.  We can map the relationships between various entities, including people.  Data can tell about a couple who marry and later divorce.  But it can’t tell us why they married or why they divorced.  The enumerated values within data models can’t address complex explanations.

Data’s fundamental purpose is to make discrete, unambiguous statements about an entity.  Data will hypostatize or reify an object. The object becomes an entify-able value: its existence is stripped down into its observable and quantifiable qualities. Once an entity is converted into values, it can be evaluated.  Even actions can be treated as entities, provided they can be enumerated into categories that are fixed in meaning.   This restriction limits the data-ification of the descriptions of processes since actions can involve so much variation.  

The ability to make discrete, unambiguous statements depends on having an agreed schema to discuss an object’s properties.  Data schemas can either be opinionated or not opinionated (an opinion being a point of view that’s not universally accepted.)   The semantics (what things mean) may involve coerced agreement — much like the terms of service you must click on to use an internet service.  How one asks questions (the syntax) can involve forced agreement as well.  Schemas can contain a range of opinions:

  • Opinionated: Everyone needs to agree to the same schema to talk about what things mean (you must accept my version of the truth.)  
  • Non-opinionated: people can define their own schemas, though they will need to learn about what others have decided if they want to use someone else’s.
  • Opinionated: everyone needs to make requests in the same way (syntax) about a set of facts defined by a schema.  
  • Non-opinionated: how one asks questions can vary and the answers (truth) are language-neutral, independent of how questions are framed. 

A deep irony — and flaw — of many efforts to structure data with a standardized schema is that these initiatives tend to dictate the use of a specific query language to access and utilize the data.  “Openness” is promoted in the name of interoperability but is done by forcing everyone to adopt a particular standard for data or queries — forcing an opinion about what’s correct and acceptable, resulting in pseudo-openness.

A schema is simply a framework that provides a context to what is being discussed.  People use mental schema to interpret the world, while machines use data schemas to do that.

 Content can have meaning independently of its immediate context, provided the content is unambiguous.  A fragment of content can often stand alone, while a fragment of data can’t.  Content doesn’t need a formal schema: it relies on shared knowledge and shared meaning.  The role of structure within content is to amplify meaning beyond the meaning carried by the words or images.  The structure of content frames  scaffolding of meaning.  This represents a big difference in how content and data approach structure.  With content, the emphasis is on using structure to build up meaning: to enlarge ideas.  With data, the emphasis is on using structure to break down meaning: to locate specific details.  

Data has meaning only within a specific context. When viewed by humans, data makes sense only as part of a record that shows the context for the values presented.   For machines, data makes sense within the hierarchical or lateral relationships defined by a schema.  The innovation of semantic data is its ability to describe the meaning of data independently of a larger record.  Semantic data creates the possibility to recontextualize data.  

Data can supply precise answers to structured questions. But data does poorly when trying to convey the meaning of concepts — the big ideas that motivate and direct our behavior.  Complex concepts are difficult to express as data.  They can be given identifiers to disambiguate them from similar words that refer to different concepts.  But even a universal ID, such as a Wikidata ID, does not make the concept of “love” clear as to its meaning.  It merely tells us we aren’t talking about a band with such a name.

Data can supply a comprehensive description only when an entity can be defined entirely by data. For example, statistical categories can be defined by data.   More often, data must rely on concepts that can only be defined using content.  Even if different resources use the same term to describe a property (such as “name”), they may define those terms differently.   

Combining details to build explanations

Another aspect of expression concerns what can be presented together to create a larger whole.  Here, we aren’t looking at what can be said by a single contributor at a single point in time, but what can be combined from different sources created at different times.

Both content and data are becoming more connected: able to combine with other similar resources.   The elements within both kinds of resources are being defined more specifically with semantics indicating their meaning or purpose.  This allows larger digital resources to be composed semi-automatically, a sort of ghost authorship.

When resources are broken into elements, they can be combined into various combinations to create new resources.  On what basis are combinations made?  It’s useful to distinguish logical composition from editorial composition.  Data is concerned with logical composition, while content is concerned with editorial composition.  

Web content historically hasn’t been structured into pieces that could be separated. All content intended for publication would be created at the same time by the same author within a single body field.  The elements within the content were tightly coupled, part of a common template.   Authors enjoyed great freedom about what they could address within an item of content but had less freedom in how their content could be combined with other content they or others created.  The presentation of a web page was fixed.

Content can be composed by stitching together statements, a process done through human curation (via links) or programmatically (by filtering and gathering similar items.)  In both cases, humans decide the rules for what kinds of things belong together to compose meaningful experiences.  The difference is that hard-coded programmatic rules are applied routinely rather than once.  

Data has always been about decoupling elements to allow them to be presented in different ways.  What the data can say might be restricted, but how it can be presented can vary in many ways.  Intrinsically, data has the flexibility to combine with other data.  It’s able to either be assembled on a one-time basis from a situational query or be generated routinely from a saved query that fires routinely.

How useful are larger resources that are assembled from smaller elements?  Not all assembled resources are equally useful.

 It is easier to break apart content into data than it is to compose content from data.  Designing a successful binding agent that turns data into content is difficult. Automated journalism can generate prose from data, but the resulting content comes across as pro forma and far from engaging.

A key difference in the structure of content and data relates to intent.  Data structure specifies the meaning of a value.  Content structure will often indicate the intent of an element as well as the meaning it conveys. For example, an author has a name, which is a data value.  The author may have a bio, which is a content value.  The bio is intended to provide context about the writer, and perhaps spark interest in what they have to say.  The bio presents some facts about the author, but its purpose is broader.

Data elements don’t have predefined purposes.  This allows them to be displayed in any number of combinations.  Content elements are less independent in how they can be displayed.  When content presents a topic, patterns of elements tend to be grouped together, and hierarchies of elements address broader and more specific aspects.  

To build an explanation, the elements of a resource need to work together to provide a unified understanding of a larger topic.  Content management has moved from a tightly-coupled structure to a loosely-coupled one, especially with the development of headless content models.  Elements can now be remixed into new presentations.  But audiences must perceive these elements as belonging together.  Loose-coupling is done by relating content types: collections of editorially-compatible elements.   

The simplest form of relations defines whether something is “part of” another or is “kind of” another.  Both these relationships enable aggregation of elements discussing smaller entities into discussions of larger ones. 

Both content and data are becoming explicit in indicating what individual elements mean. Despite that similarity, they remain different.  People confuse the concepts of structured content and structured data, yet these concepts are different in their orientation.  Both can be building blocks to generate more sophisticated resources.  But the materiality of those blocks is different.  

The structure of content reflects editorial intent.  That intent is often specific to the publisher, which is one reason why standardizations of content structure across publishers have not materialized. The area of technical documentation is a notable exception: it has tried to standardize the expression of content, with mixed results.  Some transactional content is indeed amenable to standardization: cases where people don’t want to think at all because they don’t have an opinion about the material, they trust the advice, and they don’t worry about the consequences of the advice.  API documentation is an example of documentation where the structure used by different publishers is converging, because of the high degree of similarity in its functional purpose.   

But the trend now seems to move away from making content seem like an interchangeable commodity. Content needs to sound human. The rising prominence of intention-focused approaches such as content design and UX writing reflect a growing public wariness toward cookie-cutter content for even “dry” technical and instructional topics.  Audiences don’t trust content that seems formulaic, because it sounds repetitive — even robotic. When all content follows the same limited patterns, it looks the same and people have trouble noticing what is different about each piece.  They question predefined answers that seem too tidy and lack background explanation.  Content needs to sound conversational if it hopes to garner attention — that’s true even for technical, factually-oriented content.  People hesitate if they feel they’re are being blinded by details — snowed.  Every detail should seem necessary to the larger purpose of what they are seeing. 

Editorial intent is about providing coherence to audiences.  The structure of content needs to support coherence.  It’s the opposite of the data-centric approach of the “mash-up”:  a jarring experience involving a mishmash of items that were never meant to be presented together.  

To assemble content elements successfully into a whole, the pieces need to be designed so they fit together. The pieces should support one another to provide a richer explanation than they would if they were viewed individually.  Generic standards for content elements have never seemed coherent: they have prioritized splitting things apart rather than gluing things together.  Most of them focus on factually heavy content that few people would want to read in total.  It’s possible to combine content from different sources, but the editorial intent of each element needs to fit together. The structure should be seamlessly supporting the larger message.  It shouldn’t be fragmenting the topic to where the relationship among elements is not reinforced.

Public expectations for data are different.  For data to be considered reliable, people expect it to look “regular.”  Consumers don’t want to see an asterisk appended to data, a footnote explaining some nuance.  Data is less trustworthy when posing as solid facts but presented tentatively or teasingly.   It also becomes less trustworthy when it is presented in inconsistent ways.  We expect data to be predictable and wonder what’s being hidden if it is presently in an unexpected manner. Predictable routines are more effective for data.  The regularity in data makes it seem more trustworthy and reliable.  People expect data to offer accuracy, not broader meaning. 

While content and data are different, there’s still a big push to treat content as if it was data.  We’ve seen how earlier attempts to do this, such as mash-ups, were widely unsuccessful because they were incoherent and fragmented the experience for audiences.  More recently, data enthusiasts have promoted the application of semantics to eliminate differences in how publishers describe and use their content.  It’s important to address the potential and limitations of semantics. 

Digital resources rely on semantics. But in the view of some, these resources don’t rely on semantics enough.  Some argue that all digital resources can and should be described with a common model: the RDF data model.   

The RDF data model represents the vision of what the inventor of the internet, Sir Tim Berners-Lee, thought the web should become: a connected body of commonly described data, or linked data. I’ve seen efforts to use the RDF data model to publish content rather than data.   But I’ve never seen an RDF data model that reflects editorial priorities, structuring content the way audiences expect them.  RDF works well for data but when used for content, it delivers dull, fact-based pages that at best resembles an encyclopedia. An interesting art museum collection gets transformed into a lifeless database dangling with confusing categories and options. These efforts reflect a belief that details are more important than narratives.   

Wikipedia, the most popular encyclopedia on the web, starts with content, rather than data. Wikipedia offers templates to provide an editorial framework for the content, which enables structural uniformity but doesn’t impose it.  Wikipedia is one of the most advanced experiments attempting to harmonize content with a data model, but it highlights the limitations that even a single publisher can have doing that.  It’s possible to extract data from Wikipedia but the complexity of its content has defied efforts to normalize it into a predictable data model.  

How resources are consumed and used

The flip side of expression is interpretation.  The meaning of a resource is not only a function of how it’s represented — the semantics.  Its meaning depends on how it’s received and evaluated — what humans and machines notice and understand.  

 The possibilities for machine-assisted interpretation are growing as data becomes richer, especially when it is highly curated.  People consume data — the fields of data visualization and data storytelling, for example, have exploded in the past decade, and these human-centered experiences can be machine-generated.  And we can no longer assume that stuff traditionally considered content — narrative text, for example — is never consumed by machines.  The range of what machines can do keeps growing, but they remain fundamentally different from humans in what they aim to accomplish.  

Who’s the resource for?

Let’s look at who needs or will want to use a resource. Does everyone need it, or only some people?  Many debates about the importance of standards used to describe resources arise because of different presumptions about who needs them.  Often the audience for the resource is never clearly defined.

Those who advocate for standards believe that everyone needs a shared basis of understanding: a common schema. A related belief is that everyone needs to know the same kinds of things from the resource, so using one common standard will be adequate for everyone.  Those who don’t embrace common standards consider them as extra work, often getting in the way of what’s needed to provide a complete explanation.    

If everyone agrees about the semantics — what to describe and what that signifies —  it can amplify the understanding and the potential utility of resources.  But reaching a universal agreement is difficult to achieve because people in practice want different things from resources.  They have different goals, preferences, priorities, and beliefs about the utility of various resources. What gets represented in standards is a reflection of what the standards makers want to be consumed.  Some people’s priorities don’t fit within what the standards committee is interested in supporting.  And when resources rely on semantic conventions outside of the agreed mainstream, it can potentially limit how those resources are acknowledged by IT systems that rely on these mainstream standards.  Standards are a closed system: they encourage connections within the universe of standards-adopters while indifferently ignoring outsiders.  Common standards can promote shared understanding. But they can also throttle expression by limiting the scope of what can be said.  

Data needs a schema to indicate how different bits of information are related to one another — otherwise, the data is not intelligible. Many data schemas are self-defined to reflect the requirements of the publisher.  Some data only moves around within a single publisher.  But because data is commonly exchanged between different parties, data will often utilize a common schema of some sort.  Externally-defined standards support exchanging resources with other parties. But it’s a mistake to see them as frictionless and cost-free.  

Digital resources that adhere to “open standards” present themselves as universally usable and available.  But in practice, they can become a walled garden that’s not accessible to all, due to stealth technical hurdles or barriers.  To understand this apparent paradox, one first needs to step back and ask how inclusive the standard’s development and adoption are. Standards are created by committees of often like-minded people who have specific agendas about what they want the standard to do.  If a narrow group was responsible for developing a standard, it may look like a consensus but it may never have achieved broad interest and wide adoption.  This can happen when standards are opinionated — they seem to work great, as long as one’s willing to sign on to the concepts, presumptions, and limitations of the standard.  And every standard has these constraints, though few are keen to advertise them.  Practices that are looser in their demands tend to be more flexible (and practical) and hence more widely adopted, and in the process become de-facto standards that are more impactful than many official ones.  

Data relies on APIs to broker what’s needed by different parties.  For people using apps that must access data from various sources, machines need to exchange data and APIs indicate how to do that.  Frequently, only select data needs to be available to a limited number of other parties. In order words, everyone does not need everything. A lot of data is personal to an individual: your streaming music playlist, your car service history, or your vacation plans.  While some people want to broadcast their personal data, many people are reluctant to do that, concerned about their personal privacy and cybersecurity.  

Most data is exchanged using custom APIs, where the provider of data describes its schema. Most data does not conform to a universal standard such as the RDF data model, and only a minority of IT systems support RDF.  

Yet some data is considered public and is made universally available.  Data described with RDF can connect easily to other data that are  described with that model.  RDF is most commonly used for non-proprietary data since there’s an incentive to connect such data widely.  

The audience for RDF-defined data is unique in several respects.  RDF schemas tend to be designed by technical people for other technical people: engineers, scientists, or data analysts. The people creating the schema are largely the same people using the data described by the schema. They share a mental model of what’s important. The data that’s targeted is presumed to be a public good: it should be available to everyone, who can access it by using a commonly adopted schema.  Semantic standards assume that data must have a universal identifier because anyone might need to use the data in its raw form.  This sentiment is most expressed in the notion of “open data”: that data must be downloadable and reusable by anyone — you agree to surrender rights to how what you create is used.  The “authorship” of data, unlike content, is often anonymous.  Data is easily replicated, becoming widely available.  It can quickly become public domain, where it’s not unique enough to merit copyright.

Unlike data, content is neither private nor public.  It’s meant to be widely used but normally it’s neither individually-focused or universally needed.  Not everyone will want or potentially need the same content.  They may want the content from specific sources since the source of the content is part of the context audiences consider when evaluating the content. Content is generally proprietary: it isn’t meant to mix freely with content from other sources.  Content normally has an identifiable individual or institutional author, unlike the case with much data.  To have value, content should be unique, and by extension be subject to copyright.   

For a scientist, any data is potentially useful because you never know ahead of time what facts might be connected to others.  For a teenager viewing a smartphone screen, only some content will ever be of interest.  Content, more than data, is situationally relevant; the requirement of having a universal identifier is not as great.  Content needs an identifier of some sort, but since not everyone in the world needs to be able to access all content, every piece of content doesn’t need a globally unique ID.  Most people only need a way to access the assembled content, such as a web page.  They don’t need access to the individual elements that comprise that web page. That task is delegated to an API, which worries about interpreting what’s wanted with what’s available.

Government-funded organizations often have an obligation to make everything they publish be described using an open standard. Public funding of data tends to drive data openness.  But the data itself may only be of interest to a small group of people.  Just because data utilizes common standards doesn’t imply there’s a universal demand for the data.  Data about the DNA of fruit flies or archeological artifacts from Crete may be of public interest, but they aren’t necessarily of universal interest.

What’s available to use? 

How resources can be used to some extent depends on the relationship between the parts and the whole.  

  • Do the parts have meaning when separated from the whole?  
  • Do different parts when combined resemble a coherent whole?

We can think about digital resources as fulfilling two different scenarios.  First, we have cases where the user is expecting something specific from the resource: the known-knowns.  They have a question to get answered, or a predefined experience they seek, such as listening to a specific music track.  Second, we have cases where the user doesn’t have strong expectations: either known-unknowns or even unknown-knowns.  

The issue more complex than first appears because we haven’t fully defined who the user is.  Does the user directly make their choices on their own, or do they rely on an intermediary (a machine or editor) to guide their choice?

What’s the resource for?

Both data and content can answer questions, but the kinds of questions they can answer are different.  Questions vary:

  • What does the user want to know?  
  • Do they have a specific question in mind?  
  • Does the answer provide a complete picture of what they need to know, or does it hide some important qualification or comparative? 

This distinction between data and content has practical consequences for the design of APIs that offer answers. What can be queried?  Does the value need a label to explain what it is?  Is the author of an API query a curator of content, or the seeker of knowledge?  A curator intermediator will be inclined to read the API documentation, while the knowledge seeker wants direct answers.

Both people and machines consume resources — they try to make sense of them and act on what they say.  When resources are consumed, the identity of what’s being discussed in the resource can be a vexing issue.  What is it talking about, and is it the right thing of interest?  Data IDs can help us be more precise in specifying and locating items, but they can also be misleading.  Both machines and humans presume that identical strings of characters indicate equivalent items.  In the case of humans, this is most true when the string is uncommon and assumed to refer to something unique.  Machines are normally less discriminating about how common something is: they are greedy in making inferences unless programmed not to.  Machines assume every matching set of strings refers to the same thing.  

Of course, just because different items have the same string label or ID, that doesn’t mean they are referring to the same idea.  The hashtags in folksonomies illustrate this problem.  Catchy words or phrases can become popular but have diverging meanings.  

Humans have trouble when talking about concepts, because the same word or phrase may imply different things to different people, what’s known as polysemy.  Nearly everyone has their own idea about what love means, even while nearly everyone is happy to use this four-character string.  Identity (labels or IDs) is not the same as semantics (shared meaning).  In natural language conversation, ambiguity is a recognized problem and to some extent expected, with clarifying questions a common countermeasure.  Suppose a speaker is talking about “personalization.”  The listener may wonder: what does he mean?  The listener may supply her own definition, which may be the same as the speaker’s, or slightly different — or vastly different.  The listener may use a different name to refer to what the speaker considers “personalization” — to the listener, the speaker is talking about “customization.”  It’s also possible that the speaker and listener agree about the definition of personalization, but still conceptualize it differently: one sees it as a specific kind of algorithm while the other considers it a specific kind of online behavior.  Even shared definitions don’t imply shared mental models, which involve assumptions outside of simple definitions.  

 Concepts, whether familiar or technical, often lack precise boundaries or universal definitions.  Humans often encounter others using the same terminology to refer to a concept, only to find that others don’t embrace the same meaning or perspective about it.  

Much of what passes for semantics in the computer realm is more about agreed identifiers than agreed meaning.  A “thing” that can be precisely identified can have multiple identities — depending on who is perceiving.  Machines are unable to distinguish the role of denotation from connotation.  Consider an emoji of a facial expression, which has a specific Unicode ID.  The Unicode ID denotes a visual face of some sort.  But the meaning of that facial expression emoji is subject to interpretation.  A single Unicode character can generate polysemy — multiple interpretations.  A Globally Unique ID can spark false confidence that everyone will understand the entity in the same way.

The promise and peril of IDs are that they act as stand-ins for a bunch of statements.  In natural language, we’d call them loaded-phrases: they trigger all kinds of associations. Is the terminology or labels used for a concept the changing, or is the identity of concept changing?   But maybe the ID isn’t precise?  We tend to describe things, and how we describe things tends to define them.  Maybe the descriptive properties aren’t relevant to what’s being said.  Maybe they aren’t useful at all.  

 Shades of truth

Data are about indirect experience.  They provide detached information that’s been recorded by someone or something else.  Data are generally treated as “facts” — they’re considered objective.  Content doesn’t represent itself so absolutely.  It speaks about the author or reader’s direct experience — personally acquired or understood information.  Content is about expression, which involves individual interpretation.  Notionally, data is free of interpretation, while content supplies it.  

As mentioned earlier, data is exchanged between systems and so needs to be defined with enough precision to allow that to happen.  The value of data (in theory at least) is independent of its source.  Its utility and accuracy are presumed so where the data comes from is less a concern.  Do we care who publishes a list of US presidents or where Google gets its answers presented in its knowledge panel box?  If we believe data is intrinsically objective — that data isn’t false or misleading — we don’t.  

Content — subjective and embodying an editorial point of view — is a reflection of its publishing source.  Audiences evaluate content partly by the reputation of the source. They evaluate content not just for its factual accuracy but its completeness, fairness, insight, transparency, and other qualities.   People value content for being unique.  Content can’t always easily measured or compared directly against other content.  People rely on their judgment rather than some calculated comparison.  

The presumption of knowledge graphs is that KNOWLEDGE (yes it’s a big shouty idea) can be reduced to data.  A less glamorous name for knowledge graphs is linked data, the term that Tim Berners-Lee championed two decades ago. The terms knowledge graphs and linked data are nearly synonymous, the main difference being that knowledge graphs involve a curated set of data (curation = editorial choices).  Because knowledge graphs are promoted as the answer to nearly every problem and possibility facing humankind, it’s important to be clear what knowledge graphs can and can’t do.  Knowledge graphs advance our ability to connect different facts and bring transparency to many domains. They are especially useful in supporting internally-focused enterprise use cases where experts share data with their colleagues, who can interpret the meaning and significance based on a shared understanding of its significance. Knowledge graphs are indeed useful in many scenarios.  But we must accept that the “knowledge” aspect of the term is a bit of marketing hyperbole, coined by Google.  Data, by itself, no matter how elaborately connected and explicated, does not generate knowledge.  Because knowledge requires explanation, while data can only show self-evident things.  Content, in contrast, is about presenting information that’s not self-evident to audiences.  Content explains facts.  Knowledge graphs link facts.  

Many possibilities are available to connect items of data together especially when joining tables or federating search across different sources.  But the output of data queries is still limited in what it can express.  Data are related through operators, such as equals (is), comparisons such as more than or before, and inclusion (has).  That’s a small set of expressions compared to natural language.  Semantic data is more expressive than ordinary relational data because its properties function like verbs in natural language.  But the constraint remains. For data to be manageable, the properties must be enumerated into a limited list of verbs.  Nuance is not a forte of data.  

Data’s relevance challenge

How does data establish its relevance to audiences?  Unlike content,  data was never created with a specific audience need in mind.  So how can it become relevant to people?  

Content APIs are designed around basic questions: what content will be relevant to audiences?  The API delivers specific content that provides the answer.  The body of content will be relevant to different audiences for different reasons, though no single individual will necessarily be interested in all of it. The content was created to answer specific questions.  The API’s task is to determine which slice within the content body is needed for a specific context at a specific time.  Answer-responses can be programmatically delivered using a flexible range of simple declarative questions.

Data, being more open-ended in how it can be queried, has more difficulty providing relevance reliably.  Data can answer straightforward questions easily: what’s the cheapest gadget that’s in stock, or the highest-rated gadget?  While the result doesn’t provide much explanation, the answers can point audiences to content items they may want to view content to understand more.

Knowledge graphs are supposed to turn general-purpose data into something that will be understandable and relevant to the casual inquirer. They’re able to do that when the user understands the domain already, but even then the relevance of results is uncertain.  Using knowledge graphs to extract valuable insights from data can be a deeply labor-intensive exercise  — a treasure hunt in an age when consumers expect systems will automatically tell them what they need to know.

Knowledge graphs are generally built from graph databases, which connect different types of entities and reveal their indirect relationships.  These databases have been difficult to translate into audience-facing applications.  People don’t know what to ask when dealing with indirect relationships: what relationships are potentially valuable to explore.  For this reason, we see few consumer applications using the SPARQL, the most commonly used specialized query language for knowledge graphs.  

Instead, most consumer recommendation engines operate using different sort of graph database called property graphs that are based on self-defined semantics rather than universally agreed ones.  Property graphs are optimized to find the shortest path or distance between two objects.    The goal of the recommendation is to highlight the strength of the connection rather than providing a way for the customer to drill into the multifaceted relationships.  Recommendation engineers are fundamentally different in purpose from faceted filtering.  They remove complexity from the user, rather than expose users to it.

The goal of using semantic data to “reason” — to draw non-explicit conclusions — hasn’t materialized in mainstream consumer applications.  Most semantic data is used to answer explicit questions, albeit sometimes highly complex ones.  One benefit of semantic data over traditional data is that the questions asked can be more elaborate because each item of data ultimately is relatable to other data.  The cost of this benefit is high: the answer to any question may be null — because it’s not obvious what questions yield useful answers.    The general public has limited interest in chaining together elaborate queries. They tend to ask straightforward questions.  

When questions become complex and topics are opaquely enmeshed, systems can’t expect audiences to articulate the question: they need to anticipate what’s needed.  While experts are interested in how an answer was arrived at, ordinary users on the web are more interested in getting the answer.  They want to be able to rely on curated queries that ask interesting questions that have meaningful answers.  

Richness and ambiguity in content and data

Whether content or data provides richer expression depends on many factors.  Both can be vague.  Using many words won’t necessarily make things clear.  And presenting many facts doesn’t tell you everything you need to know.

One measure of the value of a resource is its succinctness. A succinct resource can convey much.  Alternatively, it may flatten out important nuances.  Data’s value is associated with its  facticity, where facts speak for themselves and no one entertains deviant ideas about what those facts mean.  Content’s value isn’t about crystalized facts or named entities.  While content is often reduced to being about words, it’s actually about wording — interpretation, understanding, and feeling.  

Data provides confident answers to narrowly defined questions.  Content provides richer but more uncertain answers to broadly-defined ones.  

The limits of decomposition

People learn a lot by focusing on the core facts — nuggets that can be translated into data — within statements in content.  But that focus also carries a risk: the attractive but mistaken idea that all content can be boiled down to data.  People often want direct answers, which we expect can be filtered from facts.  But answers also often involve interpretations, which go beyond agreed facts. Our interpretations can diverge, even if  everyone agrees with what the facts are.

Consider a common, simple question: who is a movie for?  The content of a film can be reduced to a maturity rating. In principle, these are based on clear criteria, but in practice, they can be difficult to apply unambiguously and consistently.  Classification is often based on patterns rather than criteria satisfaction.  And the maturity rating still doesn’t tell us much about who the movie is for, even if we agree the rating is accurate.  

According to the ideas of semantic data, almost anything can be defined in terms of its properties.  An entity should explainable through its data values.  In practice, such descriptions only work for simple entities with regularized parameters — generally human-made things or abstract archetypes.  Complex things that have evolved organically over time, whether products, living beings, or ideas, can’t be easily reduced to a handful of data parameters.  They defy simple data-explication, because of:

  • Tacit qualities that can’t be articulated easily  
  • Complex attribute interactions, including irregular combinations and exceptions
  • The overloading of dimensions, where there are too many dimensions to track easily

The mundane act of identifying a bird species illustrates the issue. For many related species, it can be hard to do, as each bird species can have many attributes that vary, and many bird species have similar attributes that generate confusion in identification.  Birders rely on evaluating the “jizz” of birds. Many categories we rely on are concepts that have general tendencies rather than absolute boundaries.  For this reason, criteria-based definitions don’t aren’t satisfactory.  

Data is not also good at representing many processes and how they can change the state of things.  Consider the many ways an onion can be processed resulting in different outcomes with different property values.  The cut of onions (diagonal, minced) and cooking method (steamed, sautéed, stir-fried, deep fried) influences texture, sweetness, and so on.  The input properties influence the contours of the output properties but don’t dictate them.  There are too many subtle variables to model accurately in a data model.  Content is more efficient discussing these nuances.

Food is born from recipes: a structure of inputs yielding an output.  It’s become a stock metaphor for illustrating how content or data works.  Those who believe “content is data” love to cite the example of food recipes.  Recipes are structured, they follow a standard convention, and deal with entities and quantities.  But contrary to what most people think, recipes aren’t really data.  They are content.  They are full of nuance.

A recent semantic data research paper explored how knowledge graphs could allow substitutions in recipes.  The authors, at IBM and the Rensselaer Polytechnic Institute, explored the potential of a “knowledge graph of food.”  They had the goal of creating a database that could allow people to switch out a recipe ingredient to use something more healthful. The authors figured ingredients can be quantified according to their nutritional values: vitamins, calories, fat content, etc.  “We use linked semantic information about ingredients to develop a substitutability heuristic for automatically ranking plausible ingredient substitutions.” While the goal is laudable, the project in many ways was naive.  The authors hoped that a simple substitution would be possible with no second-order effects.  But ingredients interact in complex ways, much like the details of content do. If you want to replace cream in a recipe, there are many options with different tradeoffs. Suppose we think about recipes as formulas of ingredients.  When one ingredient changes, it changes the formula and the outcome.  The problem is that food isn’t only about chemical properties.  It’s about experiential properties.  People react to food according to its sensory qualities: taste foremost, but also texture, aroma, and appearance.  The authors acknowledge that “the quality of an ingredient substitution can be subjective and difficult to concretely determine.”  But they don’t dwell on that because it’s a problem they can’t solve.  Any cookbook writer will tell readers that substitutions are possible, but they will change the character of the dish and may necessitate other ingredient changes and compensating measures.  

When resources discuss things that are important priorities to people, they will address numerous factors they care about.   The whole will be more complex than the sum of its parts.  

Content and Data as Equals

People rely on both content and data.  It’s a mistake to view one as superior to the other.  

From the perspective of user experience, people relate to events in the world on three levels:

  1. Phenomena that we perceive or notice: the content of our experience
  2. How we identify, classify, and characterize that phenomena: the facts or data we ascribe to what we’ve encountered
  3. How we explain and evaluate the phenomena: the content of our personal explanations

Our perceptions translate into properties of an entity.   Our characterizations translate into its values.  Our explanations are complex and open-ended, like content.

Data can’t replace content.  A knowledge graph query will tell you very little about the state of knowledge graph research.  You’ll need to read PDFs of conference papers to learn that.

And modern digital content can’t be viable without drawing on data to explain entities in concrete detail.  Data can highlight what’s available within the content and clarify its details.  Data enhances content.

Standards are valuable when many different people need to do the exact same thing. A basic philosophical disagreement concerns how similar or dissimilar people’s needs are.  On one extreme is the tendency to consider every resource as unique, with no reusability. That’s the old model of web content, where data was a second-class citizen.  On the other extreme is the belief that everyone needs the exact same resources and that these can be described with standardized structure.  People cease being unique.  Content is a second-class citizen.

Content needs to move forward to develop internal publishing standards for structure: internal schemas that allow coherent assembly of different parts into meaningful wholes. That improves the range of combinations that content can present to be able to address the needs of different individuals. But the standardization of content structure can only go so far before it gets homogenized and loses its interest and value to audiences.  It’s not realistic to expect content publishers to adopt external schemas for content: they will lose their editorial voice if they do so.  

For data to increase its utility, it will need to start considering its editorial dimensions, especially who the data is will be for and what do those people need. As public wariness toward data collection by big institutions builds, ordinary people need to feel data can serve their needs and not just the needs of experts or powerful corporations.  Data curators are needed to look at how data can empower audiences.   

Data and content both have structures, which allow them to be managed in similar technical ways. Both content and data can be queried with APIs, and APIs can combine both content and data from different sources.  But despite that commonality, they have different purposes.  The goal should be to make them work together where it makes sense to, but not expect one to be hostage to each other.  

The technical possibilities for transforming content and data have never been greater.  It can be easy to become dazzled by these possibilities and idealize how beneficial they will be for people.  Too little of the discussion so far has focused on what ordinary people need. Experts need to engage more with non-experts to learn what makes digital resources truly meaningful.  

— Michael Andrews

Categories
Content Engineering

Where does Content Structuring Happen?

Many useful approaches are available to support the structuring of content.  It’s important to understand their differences, and how they complement each other.  I want to consider content structuring in terms of a spectrum of approaches that address different priorities. 

Only a few years ago most discussion about structuring content focused on the desirability of doing it.  Now we are now seeing more written about how to do it, spawning discussion around topics such as content models, design patterns, templates, message hierarchies, vocabulary lists, and other approaches.  All these approaches contribute to structuring content.  Structuring content involves a combination of human decisions, design methods, and automated systems. 

Lately I’ve been thinking about how to unlock the editorial benefits of content models, which are generally considered a technical topic. I realized that discussing this angle could be challenging because I would need to separate general ideas about structuring content from concepts that are specific to content models.  While content models provide content with structure, so do other activities and artifacts.  If the goal is to structure content, what’s unique about content models?   We need to unpack the concept of structuring content to clarify what it means in practice.

Yet structuring content is not the true end goal.  Structuring content is simply a means to an end.  It’s the benefits of structuring content that are the real goal.  The expected benefits include improved consistency, flexibility, scaleability, and efficiency. Ultimately, the goal should be to deliver more unique and distinctive content: tailored and targeted to user needs, and reflecting the brand’s strategic priorities.  

 In reality, content structuring is not a thing.  It’s an umbrella term that covers a range of things that promote benefits, such as content consistency and so on.  Many approaches contribute.  But no single can approach claim “mission accomplished.”  

What are we talking about, exactly?

People with different roles and responsibilities talk about structuring content.  It sometimes seems like they are talking about different surface features of a giant pachyderm.

The Guardian newspaper earlier this month published an article on how they use “a content model to create structured content”.  

“Structured content works well for a media company like The Guardian, but the same approach can work for any organization that publishes large amounts of data on multiple platforms. Whether you’re a government department, art gallery, retailer or university.”

The Guardian

I applaud these sentiments, and endorse them enthusiastically.  Helpfully, the article provided tangible examples of how structuring content can make publishing content easier.  But the article unintentionally highlighted how terminology on this topic can be used in different ways.  The article mentions content reuse as a benefit of content structuring.  But the examples related more to republishing finished articles with slight modification, rather than reusing discrete components of content to build new content. When the writer, a solutions architect, refers to a content type, he identifies video as an example.  Most content strategists would consider video a content format, not a content type.  Similarly, when the article illustrates the Guardian’s content model, it looks very limited in its focus (a generic article) — much more like a content type than a full content model.  

Mike Atherton commented on twitter that the article, like many discussions of content structuring, didn’t address distinctions between “presentation structure vs semantic structure, how the two are compatible or, indeed, different, and whether they can or should be captured in the same model.”  

Mike raises a fair point: we often talk about different aspects of structure, without being explicit about what aspect is being addressed.

I think about structure as a spectrum. As yet there’s no Good Housekeeping Seal Of Approval on the one right way to structure content.  Even people who are united in enthusiasm for content structure can diverge in how they discuss it — as the Guardian article shows.  I know other people use different terminology, define the same terminology in different ways, and follow slightly different processes.  That doesn’t imply others are wrong.  It merely suggests that practices are still far from settled.  How an organization uses content structuring will partly depend on the kind of content they publish, and their specific goals.  The Guardian’s approach makes sense for their needs, but may not serve the needs of other publishers.  

For me, it helps to keep the focus on the value of each distinct kind of decision offers.  For those who write simple articles, or write copy for small apps that don’t need to be coordinated with other content, some of these distinctions won’t be as important.  Structure becomes increasingly important for enterprises trying to coordinate different web-related tasks.  The essence of structure is repeatability.  

The Spectrum of content structuring

The structuring of content needs to support different decisions.   

Structure brings greater precision to content. It can influence five dimensions:

  1. How content is presented
  2. What content is presented
  3. Where content is presented
  4. What content is required
  5. What content is available

Some of these issues involve audience-facing aspects, and others involve aspects handled by backend systems.  

Different aspects of content structure

Content doesn’t acquire its structure at one position along this spectrum.  The structuring of content happens in many places.  Each decision on the spectrum has a specific activity or artifact associated with it. The issues addressed by each decision can be inter-related.  But they shouldn’t become entangled, where it is difficult to understand how each influences another.  

UI Design or Interaction Design

UI design is not just visual styling.  Interaction design shapes the experience by structuring micro-tasks and the staging of information. From a content perspective, it’s not so much about surface behaviors such as animated transitions, but how to break up the presentation of content into meaningful moments.  For example, progressive disclosure, which can be done using CSS, both paces the delivery of content and directs attention to specific elements within the content.  Increasingly, UX writers are designing content within the context of a UI design or prototype.  They need understand the cross-dependences between the behavior of content and how it is understood and perceived.  

The design of behavior involves the creation of structure.  Content needs to behave predictably to be understandable.  UI design leverages structure by utilizing design patterns and design libraries.    

Content Design

Content design encompasses the creation and arrangement of different long and short messages into meaningful experiences. It defines what is said.  

Content design is not just about styling words.  It involves all textual and visual elements that influence the understanding and perception of messages, including the interaction between different messages over time and in different scenarios.  Words are central to content design; some professionals involved with content design refer to themselves as UX writers. Terminology is finely tuned and controlled to be consistent, clear, and on-brand. 

Writers commonly break content into blocks of text.  They may use a simple tool like Dropbox paper to provide a “distraction free” view of different text elements that’s unencumbered by the visual design.  It may look a bit like a template (and is sometimes referred to as one), but it’s purpose is to help writers to plan their text, rather than to define how the text is managed.  The design of content relies heavily on the application of implicit structure.  Audiences understand better when they are comfortable knowing what they can expect.  The design may utilize a message hierarchy (identifying major and minor messages), or voice and tone guidelines that depend on the scenario in which the writing appears.  For the most part these implicit structures are managed offline through guidelines for writers, rather than through explicit formal online systems.  But some writers are looking to operationalize these guidelines into more formal design systems that are easier and more reliable to use.  

Content design involves delivering a mix of the fresh and the familiar.  The content that’s fresh, that talks about novel issues or delivers unique or distinctive messages, is unstructured — it doesn’t rely on pre-existing elements.   Messages that are familiar (recycled in some way) have the possibility of becoming structured elements.  Content design thus involves both the creation of elements that will be reused (such as feedback messaging), and ad hoc content that will be specific to given screen.  But even ad hoc elements present the opportunity reuse certain phrases and terminology so that it is consistent with the content’s tone of voice guidelines.   Some publishers are even managing strings of phrases to reuse across different content.

Page Templates

Templates provide organizational structure for the content — for example, prioritizing the order of content, and creating a hierarchy between primary and secondary content.  The template defines the elements will be consistent for any content using the template, in contrast to the interaction design, which defines the elments that will be fluid and will change and respond to users as they consume the content.  

Templates provide slots to fill with content. Page templates specify HTML structure, in contrast to the drafting templates writers use to design specific content elements.   Page templates express organizational structure, such as where an image should be placed, or where a heading is needed. The template doesn’t indicate what each heading says, which will vary according to the specifics of the content.  Templates can sometimes incorporate fixed text elements, such as copyright notice in the footer of the page, if they are specific to that page and are unlikely to change.  The critical role that templates play is that they define what’s fixed about a page that the audience will see.  Templates provides the framework for the layout of the content, allowing other aspects of the content to adjust.  

Layout has a subtle effect on how content is delivered and is accessed across different screens.  Elements that are obvious on some screen sizes may not be so on other screen sizes — for example, a list of related articles, or a cross-promotion.  Page templates must address how to make core information consistently available.  

Content Types

Content types indicate what kinds of information and messages audiences need to see to satisfy their goals.  The more specific the audience goal, the most specific the content type is likely to be. For example, many websites have an “article” content type that has only a few basic attributes, such as title, author and body.  Such types aren’t associated with any specific goal.  But a product profile on an e-commerce website will be much more specific, since different elements are important to satisfying user needs for them to decide to buy the product.  The more specific a content type, the more similar each screen of content based on it will seem, even though the specific messages and information will vary. Content types provide consistency in the kinds of information presented for a given scenario.

Content types are designed for a specific audience who has a specific goal. It specifies: to support this purpose, this information must be presented.  It answers: what elements of content needs to be delivered here for this scenario?  One of the benefits of a content type is that it can provide options to show more details, fewer details, or different details, according to the audience and scenario. 

Content types also encode business rules about the display of content. In doing so, they provide the logical structure of content.   If the content model already has defined the specifics of required information, it can pre-populate the information — enabling the reuse of content elements.  

Content Models

Content models indicate the elements of content that are available to support different audiences and scenarios.  They specify the specific kinds of messages the publisher has planned to use across different content.  They specify the semantic structure of the content — or put more simply, how different content elements are related to each other in their meaning.

Content is built from various kinds of messages associated with different topics and having different roles, such as extended descriptions, instructions, calls-to-action, value propositions, admonitions, and illustrations.  The content model provides a overview of the different kinds of essential messages that are available to build different versions and variations of content.  

In some respects, a content model is analogous to a site map.  A site map provides external audiences and systems a picture of the content published on a website.  A content model provides a map of the internal content resources that are available for publication.  But instead of representing a tree of web pages like a site map, the content model presents constellation of  “nodes”  that indicate available information resources.  A node is a basic unit of content that part of and connected to the larger structure of content.  They correspond to a content elements within published content — the units of content described within a pair of HTML tags.

Each node in a content model represents a distinct unit of content covering a discrete message or statement of information. Nodes are connected to other nodes elsewhere.  A node may be empty (authors can supply any message provided it relates to the expected meaning), or a node may be pre-populated with one or more values (indicating that the meaning will have a certain predefined message).  

Content models connect nodes by identifying the relationships between them —  how one element relates to another.  It can show how different nodes are associated, such as what role one node has to another.  For example, one node could be part of another node because is a detail relating to a larger topic.  The relationships provide pathways between different nodes of content.  

Content models are more abstract than other approaches to structuring content, and can therefore be open to wider interpretation about what they do.  The content model represents perhaps the deepest level of content structure, capturing all reusable and variable content elements. 

No single model, template or design system

No single representation of content structure can effectively depict all its different aspects.  I haven’t seen any single view representation that supports the different kinds of design decisions required.  For example, wireframes mix together fixed structures defined by templates with dynamic structures associated with UI design.  When content is embedded within screen comps, it is hard to see which elements are fixed and which are fluid.  Single views promote a tunnel focus on a specific decision, but block visibility into larger considerations that may be involved.  I’ve seen various attempts to improve wireframes to make them more interactive and content-friendly, but the basic limitations remain.

Consider a simple content element: an alert that tells a customer that their subscription is expiring and that they need to submit new payment details.  UI design needs to consider how the alert is delivered where it is noticed but not annoying.  Content design needs to decide on whether to use an existing alert, or write a new one.  The template must decide where within a  page or screen the alert appears.  The content type will specify the rules triggering delivery of the alert: who gets it, and when. And the content model may hold variations of the alert, and their mappings to different content types that use them.  You need a better alert, but what do you need to change?  What should stay the same, so you don’t mess up other things you’ve worked hard to get right?

Such decisions require coordination; different people may be responsible for different aspects. Not only must decisions and tasks be coordinated across people, they must be coordinated across time.  Those involved need to be aware of past decisions, easily reuse these when appropriate, and be able to modify them when not.  Agility is important, but so is governance.

A benefit of content structure is that it can accelerate the creation and delivery of content.  The challenge of content structure is that it’s not one thing.  There are different approaches, and each has its own value to offer.   Web publishers have more tools than ever to solve specific problems. But they still need truly integrated platforms that help web teams coordinate different kinds of decisions relating to specifying and choosing content elements. 

— Michael Andrews