Categories
Content Experience

How Content Can Answer Unanticipated Questions

How can publishers answer questions that audiences may have, when they don’t always know what will interest people? This is not a trick question. To be agile, publishers need to plan for flexibility.   They need to prepare content for scenarios they can’t anticipate in advance.

Content design has never been more important.  People have less time than ever to deal with unwanted content.  But content design should not be about spoon-feeding audiences answers to pre-approved questions.  Content design should instead empower audiences to consume the precise content they need.  Publishers should enable audiences to decide the answer that matches their need.  Publishers shouldn’t believe they can always anticipate what audiences need.  They can’t  always package content to match a  known need.  Recent developments in search technology are shaking up thinking about how to provide answers to audiences.

The Limitations of Questions as Templates for Content Development

Current practices presume a certain process.  We should start with a list of questions that users have, then write content answering those questions. The question will tell us what content to create. This approach, however, has limitations which may not be obvious.

I’ve long been an advocate and practitioner of user research.  It makes no sense to create content users indicate they have absolutely no interest in.  But user research is merely a starting point for considering user questions.  It should not be the final arbiter of what could be important to users.

“People are really fascinating and interesting … and weird! It’s really hard to guess their behaviors accurately. ” — Peter Koechley, Upworthy

Many user questions can’t be guessed — or discovered — in advance.  When doing user research, organizations can be over-confident about what questions they think users will have in the future.  User research probes the motivational level of interests and needs, rather than the more granular informational level of specific questions.  User research helps to  understand users, but it will simplify user needs into personas.  The diversity, and contextual complexity, that spawn the range of real word user questions gets smoothed over.  Qualitative user research data is too broad to uncover the full range of potential questions in detail.  Quantitative data analysis of past online queries can provide more granular insights, But even quantitative data won’t predict all situations, especially when novel situations arise.

Two common approaches to question-templated content development are:

  • The “top tasks” approach
  • The long-tail approach.

Some content strategists favor the top task approach  — especially those who focus on task-oriented transactional content.

Many SEOs favor the long tail approach — especially those who want to promote awareness-orientated marketing content.

The top tasks approach makes assumptions about essential user questions, based on past user behavior with a website.  An organization may decide that the top 10 search queries drive 90% of web traffic, so those 10 questions are the ones to offer answers.  Each question gets one answer.  It’s a rearview approach that assumes no curiosity on the part of audiences.  Audience needs exist only as an extension of their interaction with the organization.  All questions considered relevant relate to user tasks linked to that specific organization.

The hidden assumptions of the top tasks approach are:

  • Everyone has the same questions
  • Because everyone has the same questions, everyone should get the same answers
  • If different people start to ask different questions, publishers can ignore those questions, because they aren’t top questions.

Providing homogenized answers to homogenized questions is appealing to homogenized organizations.  Especially to  government offices, banks, or tech support units.  But cookie cutter content can seem like it’s created by a faceless organization.  Standardized answers don’t satisfy customer’s growing expectations. They expect more personalized service.

The long tail approach tries to anticipate user questions by crafting answers for many question variations.  Each variation addresses an ever narrower ranges of questions. The idea is to get an inventory of questions all kinds of people are asking, and then develop answers to all these questions, so there is something for everyone.  On the surface, this approach seems to deliver more individualized answers.  But we will see, that is not always the case.

Both the top tasks, and long tail, approaches assume that each question has one answer.  A content item exists to answer that one specific question.

In practice, the formula that one question has one answer doesn’t hold.   Different questions lead to the same content.  Type question variations on Google, and Google rewards you with the same links going to the same content.  Not all question variations are substantially different.  If you type “How to fly a kite” in Google, you can see related questions such as “How to fly a kite step-by-step” or “How to fly a kite by yourself”.  You’ll also find “long tail” questions such as “How to fly a kite with little wind” or even more optimistically, “How to fly a kite with no wind”.

The notion of a related search is vague.  It could be a search query that is essentially equivalent to another, but phrased differently.  It could be question that implies distinctions or details that may not be present in the information or that may not even be crucial.  Suppose we imagine content addressing “How to fly a kite for firefighters” and another on “Easy steps to kite flying for bus drivers”.  We’d likely find the essence of this long tail content is no different from the more general answer.  The idea that long tail content is necessarily more relevant is fiction.

The other characteristic of question-templated content is that the questions and answers are pre-assembled and frozen.  If we phrase a question differently, such as “What’s different about kite flying for bus drivers?”, we aren’t likely to get an answer.  At most, we’ll get content talking about kite flying that for some reason mentions bus drivers.  The content creator decides what content the reader will get, instead of the reader deciding.

Content design should be built on a foundation of compositional content.  What content is assembled and delivered can be based on the specific question asked.  Suppose you want to ask “How to tell someone to ‘go fly a kite’ ”?  When decomposed, the question reveals two distinct sub-questions.  One sub-question concerns how to deliver a message in general, covering tone or medium.  The other sub-question concerns what message alternatives are available about a specific issue — in this example, the desire to get someone else to change their behavior.

In principle, machines can assemble an answer to such a complex question, even though no person has created an answer to that specific question already.   The machine would draw on two components.  One would component address points to make about an issue; and the other component would address ways to deliver those points.

A compositional topic could be rich in variations that would yield different answers.  It could address: “How to tell a colleague…” or “How to tell a nosy relative…,” or whomever.  The answer could include components about the general aspects of the issue, which could be supplemented with some advice specific to the question variation.

For those familiar with structured content, the use of components to create content variations will seem familiar.  The difference here is that users initiate the assembly of components in novel configurations.  We don’t know in advance what the user wants, so we therefore have to provide them with the raw material to supply the answer to their unknown query.

Information Generates Questions

Part of the reason people can be unpredictable in their questions is that their interests and understanding evolve over time.  Sometimes the facts of a situation can change as well.

Laura E. Davis, digital news director of USC’s Annenberg Media Center, recently wrote about “Writing answers before you know the question.”   Her question flips the assumption that most writers have: that writers know reader questions ahead of time, and the task of the writer is to provide answers to them.  Most writers expect that information presented will follow the questions audiences ask.  But the reverse is also true. Information, or the expectation of information, sparks questions.  Sometimes writers will never have thought of the questions their readers might have.

Davis cites several trends that are making audience questions less predictable.  Audiences are becoming more conversational in how they access content.  Questions can unfold in a conversation, without knowing where they may lead.  Events can unfold quickly, and not conform to a tidy summary answer. These issues gain importance as conversational interfaces become more common.  “As we move forward, more and more, we’ll be writing answers before we know the question.”

In conversation, questions and answers flow spontaneously.  How can content become more spontaneous?  How can content prepare for a “zero UI” future, as Davis puts it?  We’ll look at two approaches, metadata and machine reading, which publishers can combine to offer laser precision in answers.

‘Literate machines’ will provide dynamic answers

Historically, questions asked online were answered by a list of hyperlinks.  Even today, many chatbots provide an answer by pointing to a hyperlink of content the reader must read.   When a computer points a user to a document title (in the form of a hyperlink), it generally is pointing the user to pre-assembled content.  Pre-assembled content runs a high risk of not being exactly what the user is looking for.

Yet the more recent trend is to provide answers directly, instead of answering queries by providing links to documents.  Everyone is familiar with Google’s instant answers. This approach is being adopted most of the other major tech companies as well.  How answers are being delivered is transforming quickly.

Advances in semantic technology and AI are allowing both questions and answers to become more iterative, and fluid.  Users may not consider a single answer to a question they pose as complete. They may want several pieces of information to give them a complete understanding.  To give users complete answers, machines stitch together several fragments from different source.  Audiences can ask clarifying or follow up questions to fill out their knowledge, and contextual answers will appear.

Semantic metadata facilitates machine discovery and understanding of information.  Metadata is powerful because it can relate information from different sources. Publishers can include their information as part of a relevant answer to a user query.  For example, suppose a user asks “What local cinemas are showing films made before 1960 this evening?”  There may not be a single item of content providing that answer.  But metadata from different content can assemble an answer.  The listings of local cinemas can be combined with data about films from a film encyclopedia (to filter by year).  The ability of metadata to assemble information from many sources upends the expectation of some publishers, who believe they must provide comprehensive information about topics to answer any audience question.  Instead, their goal should be to focus on providing comprehensive information that they are uniquely positioned to offer, and to link through metadata to other sources that provide related information that might arise in a question asked by users.

The question in this example may seem arbitrary — and it is.  Why would someone want to watch films made before 1960?  What special about 1960?  Why not 1965?  Or 1950?  Because the question, seen from the outside, seems arbitrary, no one will create content specifically to answer this question.  The variations in how the question could be framed are limitless.  Which is why metadata is powerful in providing answers to questions that may be infrequently asked, or have never been asked previously.  Just because a question is novel does not mean it is unimportant.

Given the quantity of content that’s created, someone may have written content that provides part of an answer to a question.  But that answer could be buried within a larger discussion that isn’t the focus of the user’s question.  If you are curious where a new film start grew up, there might not be specific content answering that question.  But he or she may have mentioned it in passing during an interview about their latest film.  How might you locate that information without reading various interviews in full?

Machine reading comprehension (MRC) is an emerging technique that promises to transform how content is used.  Its premise is simple but awe inspiring.  Machines can read texts just like humans do, and understand what the text means.  They can do this at incredible speeds, so that can locate specific statements quickly, interpreting what the statement means, relating it to questions or statement made elsewhere.   Machine reading does not require structure, but it presumably benefits from having structure.

Amy Webb at NYU demonstrated how machine reading comprehension works in a recent presentation (here at minute 34) . Reading a book, MRC can extract the meaning.  Yes, someday soon computers will be able to speed-read War and Peace and be able to tell us what the novel is about (beyond the obvious, that it’s about Russia.)

slide with text
Slide from Amy Webb presentation on machine reading comprehension (MRC) at ONA17 conference.

MRC has been a keen research focus of many firms developing audio interfaces.  Audioburst is a new service that digests the transcripts of audio interviews.  Users can ask Alexa a question about a news topic, and Alexa can query Audioburst to find snippets of content relevant to the query, and will combine and play back different audio clips from different radio programs related to the question.

Microsoft has been at the forefront of MRC research.   I want to highlight some of their work because they are combining MRC with semantic metadata in products that are widely used.

“We’re trying to develop what we call a literate machine: A machine that can read text, understand text and then learn how to communicate, whether it’s written or orally.” — Kaheer Suleman of Microsoft

Microsoft notes: “Machine reading comprehension systems also could help people more easily find the information they need in car manuals or dense tax code documents.”

MRC is being used in Microsoft products such as Cortana (the voice assistant similar to Alexa or Siri), and Bing (the search engine that competes with Google).

A recent news article states: “Microsoft’s virtual assistant Cortana will get an upgrade as well, allowing it to make use of machine reading comprehension to summarize search results. ”

Earlier this month, Bing announced it would use MRC: “Bing’s comparison answers understand entities, their aspects, and using machine reading comprehension, reads the web to save you time combing through numerous dense documents.”

screenshot of Bing blog post on MRC
How Bing uses machine reading to provide multifaceted answers based on text from different sources

 

For Bing users this means:

  • “If there are different authoritative perspectives on a topic, such as benefits vs drawbacks, Bing will aggregate the two viewpoints from reputable sources”
  • “If there are multiple ways to answer a question, you’ll get a carousel of intelligent answers.”
  • “If you need help figuring out the right question to ask, Bing will help you with clarifying questions.”

As the Microsoft examples highlight, the notion that there is only one best answer to a question is no longer a given.  People want different perspectives, and different levels of detail.  Literate machines can help people retrieve answers that match their interests.

Conclusion

Information-rationing is not in the best interests of content consumers.  Content strategists have long warned of the dangers of providing too much information.  But too much information isn’t necessarily the problem.  No one complains about Wikipedia having too much information.

My advice to content creators is this.  If you have unique information to share, you should publish it.  Even if you’re not sure whether users have a pre-existing need to look for that information, it could be valuable.  Self-censorship does not make sense.  At the same time, content creators should not feel they must create a complete or definitive presentation of a topic.  Increasingly, machines will be able to stitch together information from different sources for the benefit of users.  Content creators should focus on what they know best.  Duplicating information that exists elsewhere benefits no one.

We can’t predict what information people will need in the future. Content that is information-rich is worthwhile content.  We need to make such information accessible, so audience can retrieve it when it is be needed.  We need to help make machines literate.

— Michael Andrews

Categories
Agility

Seamless: Structural Metadata for Multimodal Content

Chatbots and voice interaction are hot topics right now. New services such as Facebook Messenger and Amazon Alexa have become popular quickly. Publishers are exploring how to make their content multimodal, so that users can access content in varied ways on different devices. User interactions may be either screen-based or audio-based, and will sometimes be hands-free.

Multimodal content could change how content is planned and delivered. Numerous discussions have looked at one aspect of conversational interaction: planning and writing sentence-level scripts. Content structure is another dimension relevant to voice interaction, chatbots and other forms of multimodal content. Structural metadata can support the reuse of existing web content to support multimodal interaction. Structural metadata can help publishers escape the tyranny of having to write special content for each distinct platform.

Seamless Integration: The Challenge for Multimodal Content

In-Vehicle Infotainment (IVI) systems such as Apple’s CarPlay illustrate some of challenges of multimodal content experiences. Apple’s Human Interface Guidelines state: “On-screen information is minimal, relevant, and requires little decision making. Voice interaction using Siri enables drivers to control many apps without taking their hands off the steering wheel or eyes off the road.” People will interact with content hands-free, and without looking. CarPlay includes six distinct inputs and outputs:

  1. Audio
  2. Car Data
  3. iPhone
  4. Knobs and Controls
  5. Touchscreen
  6. Voice (Siri)

The CarPlay UIKit even includes “Drag and Drop Customization”. When I review these details, much seems as if it could be distracting to drivers. Apple states with CarPlay “iPhone apps that appear on the car’s built-in display are optimized for the driving environment.” What that iPhone app optimization means in practice could determine whether the driver gets in an accident.

CarPlay screenshot
CarPlay: if it looks like an iPhone, does it act like an iPhone? (screenshot via Apple)

Multimodal content promises seamless integration between different modes of interaction, for example, reading and listening. But multimodal projects carry a risk as well if they try to port smartphone or web paradigms into contexts that don’t support them. Publishers want to reuse content they’ve already created. But they can’t expect their current content to suffice as it is.

In a previous post, I noted that structural metadata indicates how content fits together. Structural metadata is a foundation of a seamless content experience. That is especially true when working with multimodal scenarios. Structural metadata will need to support a growing range of content interactions, involving distinct modes. A mode is form of engaging with content, both in terms of requesting and receiving information. A quick survey of these modes suggests many aspects of content will require structural metadata.

Platform Example Input Mode Output Mode
Chatbots Typing Text
Devices with Mic & Display Speaking Visual (Video, Text, Images, Tables) or Audio
Smart Speakers Speaking Audio
Camera/IoT Showing or Pointing Visual or Audio

Multimodal content will force content creators to think more about content structure. Multimodal content encompasses all forms of media, from audio to short text messages to animated graphics. All these forms present content in short bursts. When focused on other tasks, users aren’t able to read much, or listen very long. Steven Pinker, the eminent cognitive psychologist, notes that humans can only retain three or four items in short term memory (contrary to the popular belief that people can hold 7 items). When exploring options by voice interaction, for example, users can’t scan headings or links to locate what they want.  Instead of the user navigating to the content, the content needs to navigate to the user.

Structural metadata provides information to machines to choose appropriate content components. Structural metadata will generally be invisible to users — especially when working with screen-free content. Behind the scenes, the metadata indicates hidden structures that are important to retrieving content in various scenarios.

Metadata is meant to be experienced, not seen. A photo of an Amazon customer’s Echo Show, revealing  code (via Amazon)

Optimizing Content With Structural Metadata

When interacting with multimodal content, users have limited attention, and a limited capacity to make choices. This places a premium on optimizing content so that the right content is delivered, and so that users don’t need to restate or reframe their requests.

Existing web content is generally not optimized for multimodal interaction — unless the user is happy listening to a long article being read aloud, or seeing a headline cropped in mid-sentence. Most published web content today has limited structure. Even if the content was structured during planning and creation, once delivered, the content lacks structural metadata that allows it to adapt to different circumstances. That makes it less useful for multimodal scenarios.

In the GUI paradigm of the web, users are expected to continually make choices by clicking or tapping. They see endless opportunities to “vote” with their fingers, and this data is enthusiastically collected and analyzed for insights. Publishers create lots of content, waiting to see what gets noticed. Publishers don’t expect users to view all their content, but they expect users to glance at their content, and scroll through it until users have spotted something enticing enough to view.

Multimodal content shifts the emphasis away from planning delivery of complete articles, and toward delivering content components on-demand, which are described by structural metadata. Although screens remain one facet of multimodal content, some content will be screen-free. And even content presented on screens may not involve a GUI: it might be plain text, such as with a chatbot. Multimodal content is post-GUI content. There are no buttons, no links, no scrolling. In many cases, it is “zero tap” content — the hands will be otherwise occupied driving, cooking, or minding children. Few users want to smudge a screen with cookie dough on their hands. Designers will need to unlearn their reflexive habit of adding buttons to every screen.

Users will express what they want, by speaking, gesturing, and if convenient, tapping. To support zero-tap scenarios successfully, content will need to get smarter, suggesting the right content, in the right amount. Publishers can no longer present an endless salad bar of options, and expect users to choose what they want. The content needs to anticipate user needs, and reduce demands on the user to make choices.

Users will aways want to choose what topics they are interested in. They may be less keen on actively choosing the kind of content to use. Visiting a website today, you find articles, audio interviews, videos, and other content types to choose from. Unlike the scroll-and-scan paradigm of the GUI web, multimodal content interaction involves an iterative dialog. If the dialog lasts too long, it gets tedious. Users expect the publisher to choose the most useful content about a topic that supports their context.

screenshot of Google News widget
Pattern: after saying what you want information about, now tell us how you’d like it (screenshot via Google News)

In the current use pattern, the user finds content about a topic of interest (topic criteria), then filters that content according to format preferences. In future, publishers will be more proactive deciding what format to deliver, based on user circumstances.

Structural metadata can help optimize content, so that users don’t have to choose how they get information. Suppose the publisher wants to show something to the user. They have a range of images available. Would a photo be best, or a line drawing? Without structural metadata, both are just images portraying something. But if structural metadata indicates the type of image (photo or line diagram), then deeper insights can be derived. Images can be A/B tested to see which type is most effective.

A/B testing of content according to its structural properties can yield insights into user preferences. For example, a major issue will be learning how much to chunk content. Is it better to offer larger size chunks, or smaller ones? This issue involves the tradeoffs for the user between the costs of interaction, memory, and attention. By wrapping content within structural metadata, publishers can monitor how content performs when it is structured in alternative ways.

Component Sequencing and Structural Metadata

Multimodal content is not delivered all at once, as is the case with an article. Multimodal content relies on small chunks of information, which act as components. How to sequence these components is important.

photo of Echo Show
Alexa showing some cards on an Echo Show device (via Amazon)

Screen-based cards are a tangible manifestation of content components. A card could show the current weather, or a basketball score. Cards, ideally, are “low touch.” A user wants to see everything they need on a single card, so they don’t need to interact with buttons or icons on the card to retrieve the content they want. Cards are post-GUI, because they don’t rely heavily on forms, search, links and other GUI affordances. Many multimodal devices have small screens that can display a card-full of content. They aren’t like a smartphone, cradled in your hand, with a screen that is scrolled. An embedded screen’s purpose is primarily to display information rather than for interaction. All information is visible on the card [screen], so that users don’t need to swipe or tap. Because most of us are accustomed to using screen-based cards already, but may be less familiar with screen-free content, cards provide a good starting point for considering content interaction.

Cards let us consider components both as units (providing an amount of content) and as plans (representing a purpose for the content). User experiences are structured from smaller units of content, but these units need have a cohesive purpose. Content structure is more than breaking content into smaller pieces. It is about indicating how those pieces can fit together. In the case of multimodal content, components need to fit together as an interaction unfolds.

Each card represents a specific type of content (recipe, fact box, news headline, etc.), which is indicated with structural metadata. The cards also present information in a sequence of some sort.1 Publishers need to know how various types of components can be mixed, and matched. Some component structures are intended to complement each other, while other structures work independently.

Content components can be sequenced in three ways. They can be:

  1. Modular
  2. Fixed
  3. Adaptive

Truly modular components can be sequenced in any order; they have no intrinsic sequence. They provide information in response to a specific task. Each task is assumed to be unrelated. A card providing an answer to the question of “What is the height of Mount Everest?” will be unrelated to a card answering the question “What is the price of Facebook stock?”

The technical documentation community uses an approach known as topic-based writing that attempts to answer specific questions modularly, so that every item of content can be viewed independently, without need to consult other content. In principle, this is a desirable goal: questions get answered quickly, and users retrieve the exact information they need without wading through material they don’t need. But in practice, modularity is hard to achieve. Only trivial questions can be answered on a card. If publishers break a topic into several cards, they should indicate the relations between the information on each card. Users get lost when information is fragmented into many small chunks, and they are forced to find their way through those chunks.

Modular content structures work well for discrete topics, but are cumbersome for richer topics. Because each module is independent of others, users, after viewing the content, need to specify what they want next. The downside of modular multimodal content is that users must continually specify what they want in order to get it.

Components can sequenced in a fixed order. An ordered list is a familiar example of structural metadata indicating a fixed order. Narratives are made from sequential components, each representing an event that happens over time. The narrative could be a news story, or a set of instructions. When considered as a flow, a narrative involves two kinds of choices: whether to get details about an event in the narrative, or whether to get to the next event in the narrative. Compared with modular content, fixed sequence content requires less interaction from the user, but longer attention.

Adaptive sequencing manages components that are related, but can be approached in different orders. For example, content about an upcoming marathon might include registration instructions, sponsorship info, a map, and event timing details, each as a separate component/card. After viewing each card, users need options that make sense, based on content they’ve already consumed, and any contextual data that’s available. They don’t want too many options, and they don’t want to be asked too many questions. Machines need to figure out what the user is likely to need next, without being intrusive. Does the user need all the components now, or only some now?

Adaptive sequencing is used in learning applications; learners are presented with a progression of content matching their needs. It can utilize recommendation engines, suggesting related components based on choices favored by others in a similar situation. An important application of adaptive sequencing is deciding when to ask a detailed question. Is the question going to be valuable for providing needed information, or is the question gratuitous? A goal of adaptive sequencing is to reduce the number of questions that must be asked.

Structural metadata generally does not explicitly address temporal sequencing, because (until now) publishers have assumed all content would be delivered at once on a single web page. For fixed sequences, attributes are needed to indicate order and dependencies, to allow software agents to follow the correct procedure when displaying content. Fixed sequences can be expressed by properties indicating step order, rank order, or event timing. Adaptive sequencing is more programmatic. Publishers need to indicate the relation of components to parent content type. Until standards catch up, publishers may need to indicate some of these details in the data-* attribute.

The sequencing of cards illustrates how new patterns of content interaction may necessitate new forms of structural metadata.

Composition and the Structure of Images

One challenge in multimodal interaction is how users and systems talk about images, as either an input (via a camera), or as an output. We are accustomed to reacting to images by tapping or clicking. We now have the chance to show things to systems, waving an object in front of a camera. Amazon has even introduced a hands-free voice activated IoT camera that has no screen. And when systems show us things, we may need to talk about the image using words.

Machine learning is rapidly improving, allowing systems to recognize objects. That will help machines understand what an item is. But machines still need to understand the structural relationship of items that are in view. They need to understand ordinary concepts such as near, far, next to, close to, background, group of, and other relational terms. Structural metadata could make images more conversational.

Vector graphics are composed of components that can represent distinct ideas, much like articles that are composed of structural components. That means vector images can be unbundled and assembled differently. The WAI-ARIA standard for web accessibility has an SVG Graphics Module that covers how to markup vector images. It includes properties to add structural metadata to images, such as group (a role indicating similar items in the image) and background (a label for elements in the image in the background). Such structural metadata could be useful for users interacting with images using voice commands. For example, the user might want to say, “Show me the image without a background” or “with a different background”.

Photos do not have interchangeable components the way that vector graphics do. But photos can present a structural perspective of a subject, revealing part of a larger whole. Photos can benefit from structural metadata that indicates the type of photo. For example, if a user wants a photo of a specific person, they might have a preference for a full-length photo or for a headshot. As digital photography has become ubiquitous, many photos are available of the same subject that present different dimensions of the subject. All these dimensions form a collection, where the compositions of individual photos reveal different parts of the subject. The IPTC photo metadata schema includes a controlled vocabulary for “scenes” that covers common photo compositions: profile, rear view, group, panoramic view, aerial view, and so on. As photography embraces more kinds of perspectives, such as aerial drone shots and omnidirectional 360 degree photographs, the value of perspective and scene metadata will increase.

For voice interaction with photo images to become seamless, machines will need to connect conversational statements with image representations. Machines may hear a command such as “show me the damage to the back bumper,” and must know to show a photo of the rear view of a car that’s been in an accident. Sometimes users will get a visual answer to a question that’s not inherently visual. A user might ask: “Who will be playing in Saturday’s soccer game?”, and the display will show headshots of all the players at once. To provide that answer, the platform will need structural metadata indicating how to present an answer in images, and how to retrieve player’s images appropriately.

Structural metadata for images lags behind structural metadata for text. Working with images has been labor intensive, but structural metadata can help with the automated processing of image content. Like text, images are composed of different elements that have structural relationships. Structural metadata can help users interact with images more fluidly.

Reusing Text Content in Voice Interaction

Voice interaction can be delivered in various ways: through natural language generation, through dedicated scripting, and through the reuse of existing text content. Natural language generation and scripting are especially effective in short answer scenarios — for example, “What is today’s 30 year mortgage rate? ” Reusing text content is potentially more flexible, because it lets publishers address a wide scope of topics in depth.

While reusing written text in voice interactions can be efficient, it can potentially be clumsy as well. The written text was created to be delivered and consumed all at once. It needs some curation to select which bits work most effectively in a voice interaction.

The WAI-ARIA standards for web accessibility offer lessons on the difficulties and possibilities of reusing written content to support audio interaction. By becoming familiar with what ARIA standards offer, we can better understand how structural metadata can support voice interactions.

ARIA standards seek to reduce the burdens of written content for people who can’t scan or click through it easily. Much web content contains unnecessary interaction: lists of links, buttons, forms and other widgets demanding attention. ARIA encourages publishers to prioritize these interactive features with the TAB index. It offers a way to help users fill out forms they must submit to get to content they want. But given a choice, users don’t want to fill out forms by voice. Voice interaction is meant to dispense with these interactive elements. Voice interaction promises conversational dialog.

Talking to a GUI is awkward. Listening to written web content can also be taxing. The ARIA standards enhance the structure of written content, so that content is more usable when read aloud. ARIA guidelines can help inform how to indicate structural metadata to support voice interaction.

The ARIA encourages publishers to curate their content: to highlight the most important parts that can be read aloud, and to hide parts that aren’t needed. ARIA designates content with landmarks. Publishers can indicate what content has role=“main”, or they can designate parts of content by region. The ARIA standard states: “A region landmark is a perceivable section containing content that is relevant to a specific, author-specified purpose and sufficiently important that users will likely want to be able to navigate to the section easily and to have it listed in a summary of the page.” ARIA also provides a pattern for disclosure, so that not all text is presented at once. All of these features allow publishers to indicate more precisely the priority of different components within the overall content.

ARIA supports screen-free content, but it is designed primarily for keyboard/text-to-speech interaction. Its markup is not designed to support conversational interaction — schema.org’s pending speakable specification, mentioned in my previous post, may be a better fit. But some ARIA concepts suggest the kinds of structures that written text need to work effectively as speech. When content conveys a series of ideas, users need to know what are major and minor aspects of text they will be hearing. They need the spoken text to match the time that’s available to listen. Just like some word processors can provide an “auto summary” of a document by picking out the most important sentences, voice-enabled text will need to identify what to include in a short version of the content. The content might be structured in an inverted pyramid, so that only the heading and first paragraph are read in the short version. Users may even want the option of hearing a short version or a long version of a story or explanation.

Structural metadata and User Intent in Voice Interaction

Structural metadata will help conversational interactions deliver appropriate answers. On the input side, when users are speaking, the role of structural metadata is indirect. People will state questions or commands in natural language, which will be processed to identify synonyms, referents, and identifiable entities, in order to determine the topic of the statement. Machines will also look at the construction of the statement to determine the intent, or the kind of content sought about the topic. Once the intent is known — what kind of information the user is seeking — it can be matched with the most useful kind of content. It is on the output side, when users view or hear an answer, that structural metadata plays an active role selecting what content to deliver.

Already, search engines such as Google rely on structural metadata to deliver specific answers to speech queries. A user can ask Google the meaning of a word or phrase (What does ‘APR’ mean?) and Google locates a term that’s been tagged with structural metadata indicating a definition, such as with the HTML element <dfn>.

When a machine understands the intent of a question, it can present content that matches the intent. If a user asks a question starting with the phrase Show me… the machine can select a clip or photograph about the object, instead of presenting or reading text. Structural metadata about the characteristics of components makes that matching possible.

Voice interaction supplies answers to questions, but not all answers will be complete in a single response. Users may want to hear alternative answers, or get more detailed answers. Structural metadata can support multi-answer questions.

Schema.org metadata indicates content that answers questions using the Answer type, which is used by many forums and Q&A pages. Schema.org distinguishes between two kinds of answers. The first, acceptedAnswer, indicates the best or most popular answer, often the answer that received most votes. But other answers can be indicated with a property called suggestedAnswer. Alternative answers can be ranked according to popularity as well. When sources have multiple answers, users can get alternative perspectives on a question. After listening to the first “accepted” answer, the user might ask “tell me another opinion” and a popular “suggested” answer could be read to them.

Another kind of multi-part answer involves “How To” instructions. The HowTo type indicates “instructions that explain how to achieve a result by performing a sequence of steps.” The example the schema.org website provides to illustrate the use of this type involves instructions on how to change a tire on a car. Imagine car changing instructions being read aloud on a smartphone or by an in-vehicle infotainment system as the driver tries to change his flat tire along a desolate roadway. This is a multi-step process, so the content needs to be retrievable in discrete chunks.

Schema.org includes several additional types related to HowTo that structure the steps into chunks, including preconditions such as tools and supplies required. These are:

  • HowToSection : “A sub-grouping of steps in the instructions for how to achieve a result (e.g. steps for making a pie crust within a pie recipe).”
  • HowToDirection : “A direction indicating a single action to do in the instructions for how to achieve a result.”
  • HowToSupply : “A supply consumed when performing the instructions for how to achieve a result.”
  • HowToTool : “A tool used (but not consumed) when performing instructions for how to achieve a result.”

These structures can help the content match the intent of users as they work through a multi-step process. The different chunks are structurally connected through the step property. Only the HowTo type ( and its more specialized subtype, the Recipe) currently accepts the step property and thus can address temporal sequencing.

Content Agility Through Structural Metadata

Chatbots, voice interaction and other forms of multimodal content promise a different experience than is offered by screen-centric GUI content. While it is important to appreciate these differences, publishers should also consider the continuities between traditional and emerging paradigms of content interaction. They should be cautious before rushing to create new content. They should start with the content they have, and see how it can be adapted before making content they don’t have.

A decade ago, the emergence of smartphones and tablets triggered an app development land rush. Publishers obsessed over the discontinuity these new devices presented, rather than recognizing their continuity with existing web browser experiences. Publishers created multiple versions of content for different platforms. Responsive web design emerged to remedy the siloing of development. The app bust shows that parallel, duplicative, incompatible development is unsustainable.

Existing content is rarely fully ready for an unpredictable future. The idealistic vision of single source, format free content collides with the reality of new requirements that are fitfully evolving. Publishers need an option between the extremes of creating many versions of content for different platforms, and hoping one version can serve all platforms. Structural metadata provides that bridge.

Publishers can use structural metadata to leverage content they have already that could be used to support additional forms of interaction. They can’t assume they will directly orchestrate the interaction with the content. Other platforms such as Google, Facebook or Amazon may deliver the content to users through their services or devices. Such platforms will expect content that is structured using standards, not custom code.

Sometimes publishers will need to enhance existing content to address the unique requirements of voice interaction, or differences in how third party platforms expect content. The prospect of enhancing existing content is preferable to creating new content to address isolated use case scenarios. Structural metadata by itself won’t make content ready for every platform or form of interaction. But it can accelerate its readiness for such situations.

— Michael Andrews


  1. Dialogs in chatbots and voice interfaces also involve sequences of information. But how to sequence a series of cards may be easier to think about than a series of sentences, since viewing cards doesn’t necessarily involve a series of back and forth questions. ↩︎