Categories
Content Engineering

Molecular Content and the Separation of Concerns

Our current ways of writing and producing web content seem ill-prepared for the needs of the future.  Content producers focus on planning articles or web pages, but existing approaches aren’t sufficiently scalable or flexible.  Web publishers need to produce growing quantities of increasingly specific content.  High volume content still requires too much human effort: the tedious crafting of generic text, and/or complicated planning that often provides inadequate flexibility.    Content producers, using tools designed for creating articles, lack a viable strategy for creating content that machines can use in new contexts, such as voice interfaces.  Before audiences see the content, machines need to act on it.  It’s time to consider the machine as a key audience segment, instead of as a incidental party.

Within the content strategy community, a discussion is starting about making content “molecular”.  Content molecules are  fundamental building blocks that can be combined and transformed in various ways.  The concept still lacks a precise definition.  But it seems  a compelling metaphor for thinking about content, given the diverse directions content is being pulled.   Molecules connect together, like Tinkertoys™.  Forces (business, technical, and consumer preferences) are pulling content in different directions.  How to connect molecules of content together is a pressing issue.  The metaphor of molecular content offers a chance to reimagine how content is created, and how it can serve future needs.

“A molecule of content is the smallest stable autonomous part of content, with a unique purpose. Molecules, of varying purpose, can be built into stable compounds of content in order to form meaning and provide a purpose.” — Andy McDonald and Toni Byrd-Ressaire

Content needs to serve many functions.  It must provide coherent narratives as it always has done.  But it is also needed in short bursts as well.  Content will need be interactive: responding to user requests, anticipating needs, updating in real time as circumstances change.  Molecular content is fundamentally different from previous ideas of “modular” content.  Modular content are static standalone chunks of words.  Molecular content is data-aware: responsive, interactive and updatable.  Molecular content is connected to logic, and gets involved in the context it is used.

In the future we can expect content will need to be:

  • Able to be speak and listen
  • Capable of animation to show a change in state
  • Able to communicate with machine sensors
  • Capable of changing its tone of voice according to the user’s state of attention.

The notion of shape shifting words seems fantastic — especially to authors accustomed to controlling each word as it appears in a text.  But words are really just a special form of data — meaningful data.  Computers can manipulate data in all manner of ways.

Content for human communication can be more complex than common computer data.  Existing data practices will contribute to the foundations of molecular content.  But they will need to be extended and enhanced to support the unique needs of words and writing.

Writers don’t like considering words as data.  They raise two objections.  First, they consider words as more nuanced  than data.  Second, they worry that the process of writing will resemble programming.  Their concerns are valid.  No progress will happen if solutions are too rigid, or too complex.  At the same time, writers need to prepare for the possibility that the process of web writing will fundamentally change.

Four Hidden Activities in Writing

Writing is a tacit process, rarely subject to analytic scrutiny.    We’re aware we form sentences involving subject, verbs and objects.  These get joined together into paragraphs, and into articles.  But the process can be so iterative that we don’t notice separate steps. When writers talk about process, they generally refer to rituals rather than workflows.

If we break down the writing process, we see different activities:

  1. Making statements (typically sentences)
  2. Choosing the subjects to write about in statements.
  3. Organizing these statements into a flow.
  4. Making judgments through implicit references.

Implicit references are the stealthiest.  In our speech and writing, humans summarize thoughts.  We may make implicit statements, or render an explicit judgment that saves us from having to list everything (for example, saying “the best…”, or “good…” ).  People who work with data talk about enumeration — basically, creating a definitive list of every value (e.g, the seven days of the week.)  When we talk, we assume shared knowledge rather than repeat it.  We assume others don’t want a complete list of every city in a country, they just want to know the largest cities.

If anything, technology has made the writing practice blur together these activities even more.  When we can rewrite on the screen, we can cut and replace with abandon.  The stringing together of words becomes unconscious.  Any attempt to make the process more explicit, and managed, can feel limiting, and is often met with resistance.

One of the big unanswered questions about molecular content is how to write it.  Molecular content will likely require a new way of creating content. First, we need to examine the process of web writing.

The Three Writing Workflows

Any web writing process needs to address several questions:

  • What things (proper nouns or entities) do you want to discuss?
  • What statements do you want to make about these things?
  • How to structure these statements (what template to use)?

These questions can be addressed in three different sequences or workflows:

  1. Author-driven
  2. Template-driven
  3. Domain-driven.

The author-driven approach treats writing as a craft, rather than as a process. The sequence starts with a blank page. The author writes a series of statements.  He structures these statements.  Finally, he may later tag things mentioned in the text to identify entities.

The author-driven sequence is:

  1. Write statements
  2. Structure text
  3. Tag text

The template-driven sequence starts with a template.  This approach is gaining popularity, with products like GatherContent providing templated forms that authors can fill-in.  The structure is pre-determined.  It is not unlike filling in an online job application.  The author needs to add text inside boxes on a form.  Later, the text can be tagged to identify entities mentioned.

The template-driven sequence is:

  1. Choose template or structure
  2. Write statements
  3. Tag Text.

The template-driven approach can sometimes allow the reuse of some blocks of text.  But often, the goal of templates is to organize content and facilitate the inputting of text.  A common organization can provide consistency for  audiences viewing different content.  But such structure doesn’t itself reduce the amount of writing required if the content has a single-use.

The domain-driven sequence starts by choosing what entities (people, places or things) the author plans to discuss.  Entities are the key variables in content.  They are what people searching for content are most likely to be seeking. Shouldn’t we know what entities we will be talking about before we start writing about them?  Once the entities are chosen, authors write statements about them.  They consider what can be said about them.  Writers can organize these statements by associating them with containers that provide structure.  Unlike the other approaches, authors don’t need to worry about tagging, because the entities are already tagged.

The domain-driven sequence is:

  1. Choose entities (pre-tagged)
  2. Write statements
  3. Choose containers for structure.

Domain-driven content writes content around entities. In other approaches, identifying what entity is mentioned is an afterthought, during tagging.  Because it is entity focused, domain-driven content is well matched to the needs molecular content.

Molecular Content through Domain-driven Writing

Domain-driven content is not new, but it is still not widely known.  I wrote about content and domains as they relate to Italian wine several years ago when I was living in Italy.  Happily, a new book has just come out by Carrie Hane and Mike  Atherton called Designing Connected Content talks about domain-driven content in detail.   It’s an excellent place to start to learn how to think about planning content from a domain perspective.  For writers wanting to understand how domains can influence writing, I recommend the blog by Teodora Petkova, who has been writing about this topic extensively.

Domain-driven writing may seem hard to envision.  How will one choose entities to write about before writing?  Perhaps writers could tap to select available entities, much like they tap to select available airline seats.  The tool could be connected to an open source knowledge graph that describes entities; a growing number of knowledge graphs are available.

The entities selected could be at the top of a screen, reminding the writer about what they should be writing about.  It could help writers remember to include details, or explore connections between topics they might not think to explore.  A tool could even helpfully offer a list of synonyms for entities, so that writers know what vocabulary they can draw on to discuss a subject.  Maybe it could even recommend some existing generic text about related entities, and the author could decide if that text is appropriate for discussing the entities they have chosen.

Domain-driven writing is a good fit for factually rich content.  It’s not an approach to use to write the next great novel.  Novelists will stick with the author-driven, blank page approach.

Much web writing is repetitive.  The details change, but the body of the text is the same. Domain-driven content puts the focus on those details.  It isolates the variables that change, from those that are constant.  In bring attention to the context that variables appear.

Even when the text of statements changes, domain-driven content allows factual data to be reused in different content.  If a product is mentioned in various content, the price can always appear beside the product name, no matter where the product name appears.

Two approaches to domain-driven writing are available, which can be used in combination.  The first approach creates reusable statements that are applicable to many different entities. The second approach allows custom statements for specific entities.

The first approach aims to standardize recurring patterns of writing by:

  1. Decomposing existing writing into common segments or chunks
  2. Normalizing or standardizing the text of these segments
  3. Reusing text of segments.

Instead of writing to make the text original-sounding, writers focus on how to make the text simpler by reducing the variation of expression.  The emphasis is on comprehension.  It is not about boosting attention by aiming for originality or the unexpected.

Let’s imagine you have a website about caring for your dog.  You have content about different topics, such as grooming your dog, or training your dog.  Dogs come in different breeds.  Does the breed of dog change anything you say about training advice?  Do you need to customize a paragraph about training requirements for specific breeds?  The website might have a mix of generic content that applies to all breeds, with some custom content relating to a specific breed.  The delta in the content helps to flush out when a specific breed is a special case in some way.

The second approach is start with entities — the key details your content addresses.  Instead of thinking about grand themes and then the details, you can reverse the sequence.  In our dog website example, we start with an collection of entities related to dogs.  This is the dog domain.  What might you want to say about the dog domain?  A tool could help you explore different angles.  You might choose a breed of dog to start with, perhaps a poodle.  The tool could show you all kinds of concepts connected to a poodle.  These concepts might be start of statements you’d want to make.  The tool would resemble a super-helpful thesaurus.  It would highlight different connections.  You could see other breeds of dog that are either similar or dissimilar.  Seeing those entities might prompt you to write some statements comparing different breeds of dog.  You might see concepts connected with dogs, such as traveling with dogs.  You could even drill down into sub-concepts, such as air travel or car travel.  The experience of traveling with a dog by air is different according to breed: for example, a Dachshund versus a Saint Bernard.  If you need to write statements to support specific tasks, the domain can help you identify related people, organizations, things, events, and locations — all the entities that are involved in the domain.

Domain-driven content is scalable.  You can start with statements about specific things, and then consider how you can generalize these statements.

Slide from a presentation by Rob Gillespie

Molecular Content: How Content Gets Liberated

Molecular content needs to be highly flexible.  To deliver such flexibility, different components need to play well with the rest of the world. One can’t overstate the diversity that exists currently in the web world.  Millions of individuals and organizations are trying to do various things, developing new solutions.   Diversity is increasing.  New platforms, new syntaxes, new channels, new programming languages, new architectures must all be accommodated.

We need to let go of the hope that one tool can do everything.  Tightly-coupled systems are seductive because they seem to offer everything you need in one place. Tightly-coupled systems give rise to authoring-development hybrid tools such as Dreamweaver or FontoXML.  Content, structure and logic all live together in a single source, which seems convenient.  But your flexibility will be limited by what those tools allow you to do.

The ideal of single sourcing of content is becoming less and less viable.  Requirements are becoming too elaborate and varied to expect a monolithic collection of files following a unified architecture to address all needs.  A single model for publishing web content can’t cope with everything being thrown at it.  Models are brittle.  We need systems where different functions are handled in different ways, depending on shifting circumstances and diverse preferences.   When you use a single model, others will reject what’s good about an approach because they hate what’s limiting about it.

Web publishing is becoming more decoupled.  Headless CMSs separate the authoring environment from content management and delivery.  Content management systems offer APIs that allow unbundled delivery of content.  Even the authoring process is getting unbundled, with new tools that specialize in distributed input, collaborative editing and offline workflows.

While the trend toward decoupling is gaining momentum, most attempts are limited in scope.  The don’t fundamentally change how content is created, or how it can be available.  They rely on the current writing paradigm, which is still document focused.  No one yet has developed solutions that make content truly molecular.

Molecular content will require a radical decoupling of systems that process content.  The only way to create content that is genuinely future-ready is to remove dependencies that require others to adopt legacy approaches and conventions.  Systems need to be adaptable, where parties involved in producing web content can swap out different sub-processes as new needs and better approaches emerge.

The Backend of Molecular Content: Separation of Concerns

It is challenging to talk about a concept as novel as molecular content without addressing how it would work.  I want to introduce a concept followed by developers called “the separation of concerns” and discuss how it is relevant to content.

Suppose a developer wanted to code a heading that said “Everyone is talking about John.”  In old-school HTML, developers would hard-code content, structure, and in-line logic together in a single HTML file.  Here’s what single source content might look like:

<html>

<body>

<h1>Everyone is talking about<div id="person"></div>.</h1>

<script>

document.getElementById("person").innerHTML = "John";

</script>

</body>

</html>

The file is hard to read, because everything is smushed together.  In a single file, we have content, structure, a variable, and a script.  It may sound efficient to have all that description in one place, until you realize you can’t reuse any of these elements.  It is brittle.

In modern practice, webpages are built from different elements: content files, templates, and separate scripts providing common logic.  Even metadata can be injected into a webpage from an outside file.  This decoupling allows many-to-many relationships.  One webpage may call many scripts, and one script may serve many webpages.  This is an example of the separation of concerns.

To separate concerns means that code is organized according to its purpose.  It is easier to maintain and reuse code when common things with similar roles are grouped together.

Let’s consider the different concerns or dimensions of how content is assembled.  The dimensions that computer systems consider to assemble content is in many respects similar to how authors assemble content.  They are:

  • Content variables
  • Narrative statements
  • Containers for content
  • Logic relating to content

These different concerns can be managed separately.

Variables

Variables are the energy in the content.  Because they vary, they are interesting.  Humans are hardwired to notice stuff that changes.

Variables live within statements.  Suppose wish aloud to our companion, Google, on our smartphone.  We say: “Ok, Google.  Get me a flight between Paris and Hong Kong for less than $500 in the first week of March.”  We have numerous variables in that one statement.  We have destinations (Paris and Hong Kong), price ($500) and time (first week of March).  Which of those is negotiable?  When we think about variables as being subject to negotiation, we can see how statements might change.

Variables animate statements the way atoms animate molecules, to use a metaphor.

Variables are frequently proper nouns or entities, which are visible in the content.   Such variables are descriptive metadata about the content.  A price mentioned in a statement is an example.

Some variables are not visible.  They are background information that won’t show up in a statement, but will be used to choose statements.  Such variables are often administrative metadata about content.  For example, to know if a statement is new or old, we could access a “published on” date variable.

What makes variables powerful is that they can be associated with each other.  These are not random words like the ones  used by a random phrase generator.  Variables follow patterns, and form associations.

For example, if we wanted to describe a person, we start by thinking about the variables associated with a person.

Person :

  Name: John,

  Gender: Male,

  Profession: Painter

A different person will have same variables, but with different variable values.  We can keep adding variables that might be useful.  This is the factual raw material that can be used in our content.

An important point about variables is that they can be represented different ways.  Because we want to separate concerns as much as possible, the variable lives separately from statements, instead of being embedded in them.  Because variables are separate, they can be transformed to serve different needs.  We don’t worry about what syntax is used.  It could be JSON, or YAML or Turtle.  When variables exist separately, their syntax can be easily converted.  We also don’t worry about what schema is used.  We can use different schemas, and note the equivalences between how different schemas refer to a variable.  We can reassign the name of a variable if required.  Maybe we want to refer to a person’s job instead of a person’s profession.  Not a problem.

Statements

Statements are generally text, though they could be an audio or video clip, or even an SVG graphic.  I’ll stick to text, since it is most familiar and easiest to discuss.

Statements will often be complete sentences, perhaps several related sentences.  But they could be shorter phrases such as a slogan or the line of a song.  Statements can be added to other statements.  Each line of a song can be joined together to produce a statement conveying the song’s full lyrics.

Statements become powerful when used in multiple places. Statements can accommodate visible variables to produce statement variations.  Some statements won’t use variables, and will be the same wherever they appear.

Statements are the basic molecules of content.  Some statements will be short, and some will be long.    The length depends on how consistent the information is.  We can use variables to produce statement variations, but the statements themselves stay consistent.  When we need need to talk about certain variables only in some situations, then new statements are needed.

Let’s look at how statements can incorporate variables.  We will use the person variables from our previous example.

Statement_1 : “Everyone is talking about {Person.Name}, the popular {Person.Profession}.”

Statement_2: “{Person.Name}’s Big Moment”

These are two alternatively worded statements that could be made about the same person.  Maybe we want to use them in different contexts.  Or we want to test which is more popular.  Or maybe they will be both used in the same article.  Because these statements are independent, they can be used in many ways.

I’ve used “pseudocode” to show how variables work within statements.  If we have many persons, we can be selective about which ones get mentioned.

But the syntax used to represent the text can follow any convention.  It could be plain text, or a subset of Markdown.  We are only interested in representing the information, not how it is structured or presented.  The information is independent of structure.  There’s no in-line markup.

Structure

Structure is how statements are arranged and represented.  Structure is one of the two ways that content molecules get “bonded” to other molecules (the other way is logic, to be discussed next).

Statements and structures have a many-to-many relationship.  That means the same statement can be used in many different structures, and a single structure can accommodate more than one statement.

A simple example (again using pseudocode) will show how statements get bonded into structures.  It is as simple as dropping the statement into a structural element.

/// Structure_1

<h1> {Statement_1}</h1>

/// But it could be instead

/// Structure_2

<h2> {Statement_1} </h2>

A single statement could be applied to many structures, including image captions or email headers.

As we consider a wider range of content, we can see how statements need to be used in different templates.  For example, the same transcript may need to appear as text of interview, and as subtitles of a video.

Structure should not be hard-coded into statements the way XML markup and CSS-selectors tend to do.  That limits the reuse of statements.

Molecular content should be independent of any specific structure, and able to adapt to various structures. We need structure flexibility.  Statements need to change structural roles.  We are accustomed to thinking about a statement having a fixed structural role.

Logic

Logic provides instructions about what content to get.  It may be a script (a few steps to do), a query (a command to find values of a certain kind), or a function (a reusable set of instructions).

Logic processes content to characterize it.  For example, if the content is about the “top” movies this week, the logic does a query to determine and display what the top-grossing films are.  Logic allows computers to make implicit statements, just like writers do, which makes the text sound more natural.

“With content molecules, content is separated not only from the presentation, but from the business logic, that is from the way the content is processed and manipulated.”  Alex Mayscheff

Logic is another way content molecules can bond together.  When logic is applied to statements, logic plays a matchmaking role.

Logic can also be applied to variables.  It can help to decide the right values to include in a statement.

A common example is when a query of a database generates a list.  The query asks the top 10 best selling literary fiction titles, and a statement is returned with 10 titles in a list.

Logic can provide more than simply reporting data.  As software gets smarter, it will be able to make more natural sounding statements.

Consider a simple example.  If we know the gender of a person, we can create new variables indicating the appropriate person pronoun and possessive pronoun to use.  Expressed in pseudocode, it might work like this:

Function(genderPronoun)

If Person.Gender == Male

  Assign

    PersonalPronoun -> He

    PossessivePronoun -> His

Else

  Assign

    PersonalPronoun -> She

    PossessivePronoun -> Her

Endif

Logic can summarize variables so they are easier for humans to comprehend.  If we only rely on variables, we have to see the values exactly has they are recorded.   In earlier examples, the variable was directly injected into a statement.  The variable says: When you get here, put a certain value here.

Using logic, a variable can call a function.  The function instructs: When you get here, figure out the appropriate value to put here.  This gives much more flexibility for the scope of values that can be used in statements.

Because the logic is separated from the variables and the statements, we don’t care what form of logic is used.  It might be PHP, Python or Javascript.  Or a query language such as SQL or Sparql.  Or some new AI algorithm.  Developers might combine different programming languages, so that different ones can perform specialized roles.  It is a very different situation than exists when content is encoded in XML, forcing developers to rely on XSLT or some other XML-focused language.

Systematizing What’s Routine

My excursion into the coding of molecular content may give a false impression that writers will need to code in the future.  I hope that is not the case.  Nearly everyone I know agrees that code is distracting when appearing in writing.  Ideally, the separation of concerns means that code won’t appear in statements.

What’s been missing are systems that make it easy for writers to reuse facts (variables), statements (content chunks), and templates (structures).  Systems should let writers add some logic to their writing without worrying about the programming behind it, perhaps by choosing some pre-made “recipes” that can be dragged into text and inserted.  I’ve seen enough different efforts to simplify systems (from Jekyll to IFTTT to automated suggestions) to believe writer-friendly tools to support molecular content are possible.  But new systems emerge only when a large community believes there is a better way.  No one person, or company, can build and sell a new system, much less force its adoption.

When I started out working in the web world, all user interface screens were individually designed.  Each one needed to be crafted and tested individually.  Each screen was a precious creation.  Eventually, the UX community realized that approach was madness.  UX folks weren’t able to keep up with the volume of screens that users needed to see.  And UX staff were recreating the same kinds of screens again and again.  Eventually, the UX community adopted components, patterns and templates.  They created systems that could scale.  Original, new designs are needed only in highly novel situations, such as new device platforms or enabling interaction technology.  The rest can be reused and repeated.

Atomic Design methodology by Brad Frost

UX designers now talk about a concept called atomic design.  Atomic design sounds related to molecular content.

The transformation of UX design is still ongoing, but it’s impressive in what has been achieved already.  One might expect designers would be resistant to technology.  Many studied graphic design in art schools, using colored markers.  When applying their graphics knowledge to web design, they saw the benefits of reusable CSS, adopted plug-and-play Javascript frameworks, and started building component libraries. Much of the progress was the work of various individuals trying to solve common problems.  Only recently have companies started marketing complete solutions for UI component management.  Designers still like to sketch, but they don’t expect screen design to be a manual craft.

I’ve long been puzzled why so many art school grads can happily embrace technology, while so many writers have an anti-technology attitude.  Designers have found how technology can extend their productivity immensely. I hope writers will discover the same.  A craft approach to writing is wonderful for novels, but insane for producing corporate web content.

Most of the original structured writing approaches built in XML are tightly-coupled, resulting in systems that are both inflexible and overly complex.  A more loosely-coupled system, based on a separation of concerns, promises to be more flexible, and can be less complex, since adopters can choose the capabilities they need and are willing to learn.  Designers have benefitted from open systems, such as CSS patterns, Javascript frameworks, and other publicly available, reusable components. Designers can choose what technology they want to use, often having more than one option. Writers need open systems to support their work as well.

—  Michael Andrews

Categories
Content Engineering

Structural Metadata: Key to Structured Content

Structural metadata is the most misunderstood form of metadata.  It is widely ignored, even among those who work with metadata. When it is discussed, it gets confused with other things.  Even people who understand structural metadata correctly don’t always appreciate its full potential. That’s unfortunate, because structural metadata can make content more powerful. This post takes a deep dive into what structural metadata is, what it does, and how it is changing.

Why should you care about structural metadata? The immediate, self-interested answer is that structural metadata facilitates content reuse, taking content that’s already created to deliver new content. Content reuse is nice for publishers, but it isn’t a big deal for audiences.  Audiences don’t care how hard it is for the publisher to create their content. Audiences want content that matches their needs precisely, and that’s easy to use.  Structural metadata can help with that too.

Structural metadata matches content with the needs of audiences. Content delivery can evolve beyond creating many variations of content — the current preoccupation of many publishers. Publishers can use structural metadata to deliver more interactive content experiences.  Structural metadata will be pivotal in the development of multimodal content, allowing new forms of interaction, such as voice interaction.  Well-described chunks of content are like well-described buttons, sliders and other forms of interactive web elements.  The only difference is that they are more interesting.  They have something to say.

Some of the following material will assume background knowledge about metadata.  If you need more context, consult my very approachable book, Metadata Basics for Web Content.

What is Structural Metadata?

Structural metadata is data about the structure of content.  In some ways it is not mysterious at all.  Every time you write a paragraph, and enclose it within a
<p> paragraph element, you’ve created some structural metadata.  But structural metadata entails far more than basic HTML tagging.  It gives data to machines on how to deliver the content to audiences. When structural metadata is considered as a fancy name for HTML tagging, much of its potency gets missed.

The concept of structural metadata originated in the library and records management field around 20 years ago. To understand where structural metadata is heading, it pays to look at how it has been defined already.

In 1996, a metadata initiative known as the Warwick Framework first identified structural metadata as “data defining the logical components of complex or compound objects and how to access those components.”

In 2001, a group of archivists, who need to keep track of the relationships between different items of content, came up with a succinct definition:  “Structural metadata can be thought of as the glue that binds compound objects together.”

By 2004, the National Information Standards Organization (NISO) was talking about structural metadata in their standards.  According to their definition in the z39.18 standard, “Structural metadata explain the relationship between parts of multipart objects and enhance internal navigation. Such metadata include a table of contents or list of figures and tables.”

Louis Rosenfeld and Peter Morville introduced the concept of structural metadata to the web community in their popular book, Information Architecture for the World Wide Web — the “Polar Bear” book. Rosenfeld and Morville use the structural metadata concept as a prompt to define the information architecture of a websites:

“Describe the information hierarchy of this object. Is there a title? Are there discrete sections or chunks of content? Might users want to independently access these chunks?”

A big theme of all these definitions is the value of breaking content into parts.  The bigger the content, the more it needs breaking down.  The structural metadata for a book relates to its components: the table of contents, the chapters, parts, index and so on.  It helps us understand what kinds of material is within the book, to access specific sections of the book, even if it doesn’t tell us all the specific things the book discusses.  This is important information, which surprisingly, wasn’t captured when Google undertook their massive book digitization initiative a number of years ago.  When the books were scanned, entire books became one big file, like a PDF.   To find a specific figure or table within book on Google books requires searching or scrolling to navigate through the book.

Image of Google Books webpage.
The contents of scanned books in Google Books lack structural metadata, limiting the value of the content.

Navigation is an important purpose of structural metadata: to access specific content, such as a specific book chapter.  But structural metadata has an even more important purpose than making big content more manageable.  It can unbundle the content, so that the content doesn’t need to stay together. People don’t want to start with the whole book and then navigate through it to get to a small part in which they are interested. They want only that part.

In his recent book Metadata, Richard Gartner touches on a more current role for structural metadata: “it defines structures that bring together simpler components into something larger that has meaning to a user.” He adds that such information “builds links between small pieces of data to assemble them into a more complex object.”

In web content, structural metadata plays an important role assembling content. When content is unbundled, it can be  rebundled in various ways.  Structural metadata identifies the components within content types.  It indicates role of the content, such as whether the content is an introduction or a summary.

Structural metadata plays a different role today than it did in the past, when the assumption was that there was one fixed piece of large content that would be broken into smaller parts, identified by structural metadata.  Today, we may compose many larger content items, leveraging structural metadata, from smaller parts.

The idea of assembling content from smaller parts has been promoted in particular by DITA evangelists such as Anne Rockley (DITA is a widely used framework for technical documentation). Rockley uses the phrase “semantic structures” to refer to structural metadata, which she says “enable(s) us to understand ‘what’ types of content are contained within the documents and other content types we create.”  Rockley’s discussion helpfully makes reference to content types, which some other definitions don’t explicitly mention.  She also introduces another concept with a similar sounding name, “semantically rich” content, to refer to a different kind of metadata: descriptive metadata.  In XML (which is used to represent DITA), the term semantic is used generically for any element. Yet the difference between structural and descriptive metadata is significant — though it is often obscured, especially in the XML syntax.

Curiously, semantic web developments haven’t focused much on structural metadata for content (though I see a few indications that this is starting to change).  Never assume that when someone talks about making content semantic, they are talking about adding structural metadata.

Don’t Confuse Structural and Descriptive Metadata

When information professionals refer to metadata, most often they are talking about descriptive metadata concerning people, places, things, and events.  Descriptive metadata indicates the key information included within the content.  It typically describes the subject matter of the content, and is sometimes detailed and extensive.  It helps one discover what the content is about, prior to viewing the content.  Traditionally, descriptive metadata was about creating an external index — a proxy — such as assigning a keywords or subject headings about the content. Over the past 20 years, descriptive metadata has evolved to describing the body of the content in detail, noting entities and their properties.

Richard Gartner refers to descriptive metadata as “finding metadata”: it locates content that contains some specific information.  In modern web technology, it means finding values for a specific field (or property).  These values are part of the content, rather than separate from it.  For example, find smartphones with dual SIMs that are under $400.  The  attributes of SIM capacity and price are descriptive metadata related to the content describing the smartphones.

Structural metadata indicates how people and machines can use the content.  If people see a link indicating a slideshow, they have an expectation of how such content will behave, and will decide if that’s the sort of content they are interested in.  If a machine sees that the content is a table, it uses that knowledge to format the content appropriately on a smartphone, so that all the columns are visible.  Machines rely extensively on structural metadata when stitching together different content components into a larger content item.

diagram showing relationship of structural and descriptive metadata
Structural and descriptive metadata can be indicated in the same HTML tag.  This tag indicates the start of an introductory section discussing Albert Einstein.

Structural metadata sometimes is confused with descriptive metadata because many people use vague terms such as “structure” and “semantics” when discussing content. Some people erroneously believe that structuring content makes the content “semantic”.  Part of this confusion derives from having an XML-orientation toward content.  XML tags content with angle-bracketed elements. But XML elements can be either structures such as sections, or they can be descriptions such as names.  Unlike HTML, where elements signify content structure while descriptions are indicated in attributes, the XML syntax creates a monster hierarchical tree, where content with all kinds of roles are nested within elements.  The motley, unpredictable use of elements in XML is a major reason it is unpopular with developers, who have trouble seeing what roles different parts of the content have.

The buzzword “semantically structured content” is particularly unhelpful, as it conflates two different ideas together: semantics, or what content means, with structure, or how content fits together.  The semantics of the content is indicated by descriptive metadata, while the structure of the content is indicated by structural metadata.  Descriptive metadata can focus on a small detail in the content, such as a name or concept (e.g., here’s a mention of the Federal Reserve Board chair in this article).  Structural metadata, in contrast, generally focuses on a bigger chunk of content: here’s a table, here’s a sidebar.   To assemble content, machines need to distinguish what the specific content means, from what the structure of the content means.

Interest in content modeling has grown recently, spurred by the desire to reuse content in different contexts. Unfortunately, most content models I’ve seen don’t address metadata at all; they just assume that the content can be pieced together.  The models almost never distinguish between the properties of different entities (descriptive metadata), and the properties of different content types (structural metadata). This can lead to confusion.  For example, a place has an address, and that address can be used in many kinds of content.  You may have specific content types dedicated to discussing places (perhaps tourist destinations) and want to include address information.  Alternatively, you may need to include the address information in content types that are focused on other purposes, such as a membership list.  Unless you make a clear distinction in the content model between what’s descriptive metadata about entities, and what’s structural metadata about content types, many people will be inclined to think there is a one-to-one correspondence between entities and content types, for example, all addresses belong the the content type discussing tourist destinations.

Structural metadata isn’t merely a technical issue to hand off to a developer.  Everyone on a content team who is involved with defining what content gets delivered to audiences, needs to jointly define what structural metadata to include in the content.

Three More Reasons Structural Metadata Gets Ignored…

Content strategists have inherited frameworks for working with metadata from librarians, database experts and developers. None of those roles involves creating content, and their perspective of content is an external one, rather than an internal one. These hand-me-down concepts don’t fit the needs of online content creators and publishers very well.  It’s important not to be misled by legacy ideas about structural metadata that were developed by people who aren’t content creators and publishers.  Structural metadata gets sidelined when people fail to focus on the value that content parts can contribute in different scenarios.

Reason 1: Focus on Whole Object Metadata

Librarians have given little attention to structural metadata, because they’ve been most concerned with cataloging and  locating things that have well defined boundaries, such as books and articles (and most recently, webpages).  Discussion of structural metadata in library science literature is sparse compared with discussions of descriptive and administrative metadata.

Until recently, structural metadata has focused on identifying parts within a whole.  Metadata specialists assumed that a complete content item existed (a book or document), and that structural metadata would be used to locate parts within the content.  Specifying structural metadata was part of cataloging existing materials. But given the availability of free text searching and more recently natural language processing, many developers question the necessity of adding metadata to sub-divide a document. Coding structural metadata seemed like a luxury, and got ignored.

In today’s web, content exists as fragments that can be assembled in various ways.  A document or other content type is a virtual construct, awaiting components. The structural metadata forms part of the plan for how the content can fit together. It’s important to define the pieces first.

Reason 2: Confusion with Metadata Schemas

I’ve recently seen several cases where content strategists and others mix up the concept of structural metadata, with the concept of metadata structure, better known as metadata schemas.  At first I thought this confusion was simply the result of similar sounding terms.  But I’ve come to realize that some database experts refer to structural metadata in a different way than it is being used by librarians, information architects, and content engineers.  Some content strategists seem to have picked up this alternative meaning, and repeat it.

Compared to semi-structured web content, databases are highly regular in structure.  They are composed of tables of rows and columns.  The first column of a row typically identifies what the values relate to.  Some database admins refer to those keys or properties as the structure of the data, or the structural metadata.  For example, the OECD, the international statistical organization, says: “Structural metadata refers to metadata that act as identifiers and descriptors of the data.  Structural metadata are needed to identify, use, and process data matrixes and data cubes.”   What is actually being referred to is the schema of the data table.

Database architects develop many custom schemas to organize their data in tables.  Those schemas are very different from the standards-based structural metadata used in content.  Database tables provide little guidance on how content should be structured.  Content teams shouldn’t rely on a database expert to guide them on how to structure their content.

Reason 3: Treated as Ordinary Code

Web content management systems are essentially big databases built in programming language like PHP or .Net.  There’s a proclivity among developers to treat chunks of content as custom variables.  As one developer noted when discussing WordPress: “In WordPress (WP), the meaning of Metadata is a bit fuzzier.  It stores post metadata such as custom fields and additional metadata added via plugins.”

As I’ve noted elsewhere, many IT systems that manage content ignore web metadata standards, resulting in silos of content that can’t work together. It’s not acceptable to define chunks of content as custom variables. The purpose of structural metadata is to allow different chunks of content to connect with each other.  CMSs need to rely on web standards for their structural metadata.

Current Practices for Structural Metadata

For machines to piece together content components into a coherent whole, they need to know the standards for the structural metadata.

Until recently, structural metadata has been indicated only during the prepublication phase, an internal operation where standards were less important.  Structural metadata was marked up in XML together with other kinds of metadata, and transformed into HTML or PDF.  Yet a study in the journal Semantic Web last year noted: “Unfortunately, the number of distinct vocabularies adopted by publishers to describe these requirements is quite large, expressed in bespoke document type definitions (DTDs). There is thus a need to integrate these different languages into a single, unifying framework that may be used for all content.”

XML continues to be used in many situations.  But a recent trend has been to adopt more light weight approaches, using HTML, to publish content directly.  Bypassing XML is often simpler, though the plainness of HTML creates some issues as well.

As Jeff Eaton has noted, getting specific about the structure of content using HTML elements is not always easy:

“We have workhorse elements like ul, div, and span; precision tools like cite, table, and figure; and new HTML5 container elements like section, aside, and nav. But unless our content is really as simple as an unattributed block quote or a floated image, we still need layers of nested elements and CSS classes to capture what we really mean.”

Because HTML elements are not very specific, publishers often don’t know how to represent structural metadata within HTML.  We can learn from the experience of publishers who have used XML to indicate structure, and who are adapting their structures to HTML.

Scientific research, and technical documentation are two genres where content structure is well-established, and structural metadata is mature.  Both these genres have explored how to indicate the structure of their content in HTML.

Scientific research papers are a distinct content type that follows a regular pattern. The National Library of Medicine’s Journal Article Tag Suite (JATS) formalizes the research paper structure into a content type as an XML schema.  It provides a mixture of structural and descriptive metadata tags that are used to publish biomedical and other scientific research.  The structure might look like:

<sec sec-type="intro">

<sec sec-type="materials|methods">

<sec sec-type="results">

<sec sec-type="discussion">

<sec sec-type="conclusions">

<sec sec-type="supplementary-material" ... >

Scholarly HTML is an initiative to translate the typical sections of a research paper into common HTML.  It uses HTML elements, and supplements them with typeof attributes to indicate more specifically the role of each section.  Here’s an example of some attribute values in their namespace, noted by the prefix “sa”:

<section typeof="sa:MaterialsAndMethods">

<section typeof="sa:Results">

<section typeof="sa:Conclusion">

<section typeof="sa:Acknowledgements">

<section typeof="sa:ReferenceList">

As we can see, these sections overlap with the JATS, since both are describing similar content structures.  The Scholarly HTML initiative is still under development, and it could eventually become a part of the schema.org effort.

DITA — the technical documentation architecture mentioned earlier — is a structural metadata framework that embeds some descriptive metadata.  DITA structures topics, which can be different information types: Task, Concept, Reference, Glossary Entry, or Troubleshooting, for example.  Each type is broken into structural elements, such as title, short description, prolog, body, and related links.  DITA is defined in XML, and uses many idiosyncratic tags.

HDITA is a draft syntax to express DITA in HTML.  It converts DITA-specific elements into HTML attributes, using the custom data-* attribute.  For example a “key definition” element <keydef> becomes an attribute within an HTML element, e.g. <div data-hd-class="keydef”>
.  Types are expressed with the attribute data-hd-type.

The use of the data-* offers some advantages, such as javascript access by clients.  It is not, however, intended for use as a cross-publisher metadata standard. The W3C notes: “A custom data attribute is an attribute in no namespace…intended to store custom data private to the page or application.”  It adds:

“These attributes are not intended for use by software that is not known to the administrators of the site that uses the attributes. For generic extensions that are to be used by multiple independent tools, either this specification should be extended to provide the feature explicitly, or a technology like microdata should be used (with a standardized vocabulary).”

The HDITA drafting committee appears to use “hd” in the data attribute to signify that the attribute is specific to HDITA.  But they have not declared a namespace for these attributes (the XML namespace for DITA is xmlns:ditaarch.)  This will prevent automatic machine discovery of the metadata by Google or other parties.

The Future of Structural Metadata

Most recently, several initiatives have explored possibilities for extending structural metadata in HTML.  These revolve around three distinct approaches:

  1. Formalizing structural metadata as properties
  2. Using WAI-ARIA to indicate structure
  3. Combining class attributes with other metadata schemas

New Vocabularies for Structures

The web standards community is starting to show more interest in structural metadata.  Earlier this year, the W3C released the Web Annotation Vocabulary.  It provides properties to indicate comments about content.  Comments are an important structure in web content that are used in many genres and scenarios. Imagine that readers may be highlighting passages of text. For such annotations to be captured, there must be a way to indicate what part of the text is being referenced.  The annotation vocabulary can reference specific HTML elements and even CSS selectors within a body of text.

Outside of the W3C, a European academic group has developed the Document Components Ontology (DoCO), “a general-purpose structured vocabulary of document elements.”  It is a detailed set of properties for describing common structural features of text content.  The DoCO vocabulary can be used by anyone, though its initial adoption will likely be limited to research-oriented publishers.  However, many specialized vocabularies such as this one have become extensions to schema.org.  If DoCO were in some form adsorbed by schema.org, its usage would increase dramatically.

Diagram showing document ontology
Diagram showing document components ontology

 WAI-ARIA

WAI-ARIA is commonly thought of as a means to make functionality accessible.  However, it should be considered more broadly as a means to enhance the functionality of web content overall, since it helps web agents understand the intentions of the content. WAI-ARIA can indicate many dynamic content structures, such as alerts, feeds, marquees, and regions.

The new Digital Publishing WAI-ARIA developed out of the ePub standards, which have a richer set of structural metadata than is available in standard HTML5.  The goal of the Digital Publishing WAI-ARIA is to “produce structural semantic extensions to accommodate the digital publishing industry”.  It has the following structural attributes:

  • doc-abstract
  • doc-acknowledgments
  • doc-afterword
  • doc-appendix
  • doc-backlink
  • doc-biblioentry
  • doc-bibliography
  • doc-biblioref
  • doc-chapter
  • doc-colophon
  • doc-conclusion
  • doc-cover
  • doc-credit
  • doc-credits
  • doc-dedication
  • doc-endnote
  • doc-endnotes
  • doc-epigraph
  • doc-epilogue
  • doc-errata
  • doc-example
  • doc-footnote
  • doc-foreword
  • doc-glossary
  • doc-glossref
  • doc-index
  • doc-introduction
  • doc-noteref
  • doc-notice
  • doc-pagebreak
  • doc-pagelist
  • doc-part
  • doc-preface
  • doc-prologue
  • doc-pullquote
  • doc-qna
  • doc-subtitle
  • doc-tip
  • doc-toc

 

To indicate an the structure of a text box showing an example:

<aside role="doc-example">

<h1>An Example of Structural Metadata in WAI-ARIA</h1>

…

</aside>

Content expressing a warning might look like this:

<div role="doc-notice" aria-label="Explosion Risk">

<p><em>Danger!</em> Mixing reactive materials may cause an explosion.</p>

</div>

Although book-focused, DOC-ARIA roles provide a rich set of structural elements that can be used with many kinds of content.  In combination with the core WAI-ARIA, these attributes can describe the structure of web content in extensive detail.

CSS as Structure

For a long while, developers have been creating pseudo structures using CSS, such as making infoboxes to enclose certain information. Class is a global attribute of HTML, but has become closely associated with CSS, so much so that some believe that is its only purpose.  Yet Wikipedia notes: “The class attribute provides a way of classifying similar elements. This can be used for semantic purposes, or for presentation purposes.”  Some developers use what are called “semantic classes” to indicate what content is about.  The W3C advises when using the class attribute: “authors are encouraged to use values that describe the nature of the content, rather than values that describe the desired presentation of the content.”

Some developers claim that the class attribute should never be used to indicate the meaning of content within an element, because HTML elements will always make that clear. I agree that web content should never use the class attribute as a substitute for using a meaningful HTML element. But the class attribute can sometimes further refine the meaning of an HTML element. Its chief limitation is that class names involve private meanings. Yet if they are self-describing they can be useful.

Class attributes are useful for selecting content, but they operate outside of metadata standards.  However, schema.org is proposing a property that will allow class values to be specified within schema.org metadata.  This has potentially significant implications for extending the scope of structural metadata.

The motivating use case is as follows: “There is a need for authors and publishers to be able to easily call out portions of a Web page that are particularly appropriate for reading out aloud. Such read-aloud functionality may vary from speaking a short title and summary, to speaking a few key sections of a page; in some cases, it may amount to speaking most non-visual content on the page.”

The pending cssSelector property in schema.org can identify named portions of a web page.  The class could be a structure such as a summary or a headline that would be more specific than an HTML element.  The cssSelector has a companion property called xpath, which identifies HTML elements positionally, such as the paragraphs after h2 headings.

These features are not yet fully defined. In addition to indicating speakable content, the cssSelector can indicate parts of a web page. According to a Github discussion: “The ‘cssSelector’ (and ‘xpath’) property would be particularly useful on http://schema.org/WebPageElement to indicate the part(s) of a page matching the selector / xpath.  Note that this isn’t ‘element’ in some formal XML sense, and that the selector might match multiple XML/HTML elements if it is a CSS class selector.”  This could be useful selecting content targeted at specific devices.

The class attribute can identify structures within the web content, working together with entity-focused properties that describe specific data relating to the content.  Both of these indicate content variables, but they deliver different benefits.

Entity-based (descriptive) metadata can be used for content variables about specific information. They will often serve as  text or numeric variables. Use descriptive metadata variables when choosing what informational details to put in a message.

Structural metadata can be used phrase-based variables, indicating reusable components.    Phrases can be either blocks (paragraphs or divs), or snippets (a span).  Use structural metadata variables when choosing the wording to convey a message in a given scenario.

A final interesting point about cssSelector’s in schema.org.  Like other properties in schema.org, these can be expressed either as inline markup in HTML (microdata) or as an external JSON-LD script.  This gives developers the flexibility to choose whether to use coding libraries that are optimized for arrays (JSON-flavored), or ones focus on selectors.  For too long, what metadata gets included has been influenced by developer preferences in coding libraries.  The fact that CSS attributes can be expressed as JSON suggests that hurdle is being transcended.

Conclusion

Structural metadata is finally getting some love in the standards community, even though awareness of it remains low among developers.  I hope that content teams will consider how they can use structural metadata to be more precise in indicating what their content does, so that it can be used flexibly in emerging scenarios such as voice interactions.

— Michael Andrews