Author: Michael Andrews

Content Velocity, Scope, and Strategy

Post author By Michael Andrews
Post date September 12, 2017

I want to discuss three related concepts.

First, the velocity of the content, or how quickly the essence of content changes.

Second, the scope that the content addresses, that is, whether it is meant for one person or many people, and whether it is intended to have a short or a long life span.

Lastly, how those factors affect publishing strategy, and the various capabilities publishers need.

These concepts — velocity, scope and strategy — can help publishers diagnose common problems in content operations. Many organizations produce too much content, and find they have repetitive or dated content. Others struggle to implement approaches that were developed and successfully used in one context, such as technical documentation, and apply them to another, such as marketing communication. Some publishers don’t have clear criteria addressing the utility of content, such as when new content is needed, or how long published content should be kept online. Instead of relying on hunches to deal with these issues, publishers should structure their operations to reflect the ultimate purpose of their content.

Content should have a core rationale for why it exists. Does the organization publish content to change minds — to get people to perceive topics differently, learn about new ideas, or to take an action they might not otherwise take without the content? Or does it publish content to mind the change — to keep audiences informed with the most current information, including updates and corrections, on topics they already know they need to consult?

When making content, publishers should be able to answer: In what way is the content new? Is it novel to the audience, or just an update on something they already know about? The concept of content velocity can help us understand how quickly the information associated with content changes, and the extent to which newly created content provides new information.

Content Velocity: Assessing Newness and Novelty

All content is created based on the implicit assumption that it says something new, or better, than existing content that’s available. Unfortunately, much new content gets created without ever questioning whether it is completely necessary. True, people need information about a topic to support a task or goal they have. But is new content really necessary? Or could existing content be revised to address these needs?

Let’s walk through the process by which new content gets created.

Is new content necessary, or should existing content be revised?

The first decision is whether the topic or idea warrants the creation of new content, or whether existing content covers much of the same material. If the topic or idea is genuinely new and has not been published previously, then new content is needed. If the publisher has only a minor update to material they’ve previously published, they should update the existing content, and not create new content. They may optionally issue an alert indicating that a change has been made, but such a notification won’t be part of the permanent record. Too often, publishers decide to write new articles about minor changes that get added to the permanent stock of content. Since the changes were minor, most such articles repeat information already published elsewhere, resulting in duplication and confusion for all concerned.

The next issue is to decide if the new content is likely to be viewed by an individual more than once. This is the shelf life of the content, considered from the audience’s perspective. Some content is disposable: its value is negligible after being viewed, or if never viewed by a certain date. Content strategists seldom discuss short-lived, disposable content, except to criticize it as intrinsically wasteful. Yet some content, owing to its nature, is short lived. Like worn razor blades or leftover milk, it won’t be valuable forever. It needs to disappear from the individual’s field of vision when it is no longer useful. If the audience considers the content disposable, then the publisher needs to treat it that way as well, and have a process for getting the content off the shelf. Other content is permanent: it always needs to be available, because people may need to consult it more than once.

Publishers must also decide whether the content is either custom (intended for a specific individual), or generic (intended for many people). We will return to custom and generic content shortly.

If the publisher already has content covering the topic, it needs to ask whether new information has emerged that requires existing content to be updated. We’d also like to know if some people may have seen this existing content previously, and will be interested in knowing what’s changed. For example, I routinely consult W3C standards drafts. I may want to know what’s different between one revision compared with the prior one, and appreciate when that information is called out. For content I don’t routinely consult, I am happy to simply know that all details are current as of a certain date when the content was last revised.

One final case exists, which is far too common. The publisher has already covered the topic or idea, and has no new information to offer. Instead, they simply repackage existing content, giving it a veneer of looking new. While repackaged content is sometimes okay if it involves a genuinely different approach to presenting the information, it is generally not advisable. Repackaged content results from the misuse of the concept of content reuse. Many marketing departments have embraced content reuse as a way to produce ever more content, saying the same thing, in the hopes that some of this content will be viewed. The misuse of content reuse, particularly the automated creation of permanent content, is fueling an ever growing content bubble. Strategic content reuse, in contrast, involves the coordination of different content elements into unique bundles of information, especially customized packages of information that support targeted needs or interests.

Once publishers decide to create new content, they need to decide content scope, the content’s expected audience and expected use.

Content Scope: Assessing Uniqueness and Specificity

Content scope refers to how unique or specific newly created content is. We can consider uniqueness in terms of audiences (whether the content is for a specific individual, or a collective group), and in terms of time (is the content meant to be used at a specific moment only, or will be viewed again). Content that is intended for a specific individual, or for viewing at a specific time, is more unique, and has narrower range of uses, than content that’s been created for many people, or for viewing multiple times by the same person. How and when the audience uses the content will influence how the publisher will need to create and manage that content.

Scope can vary according to four dimensions:

The expected frequency of use
The expected audience size
The archival and management approach (which will mirror the expected frequency of use)
The content production approach (which will mirror the expected audience size)

The expected frequency of use looks at whether someone is likely to want to view content again after seeing it once. This looks at relevance from an individual’s perspective, rather than a publisher’s perspective. Publishers may like to think they are creating so-called evergreen content that people will always find relevant, but from an audience perspective, most content, once viewed, will never be looked at again. When audiences encounter content they’ve previously viewed, they are likely to consider it as clutter, unless they’ve a specific reason to view it again. Audiences are most likely to consider longer, more substantive content on topics of enduring interest as permanent content. They are likely to consider most other content as disposable.

Disposable content is often event driven. Audiences need content that addresses a specific situation, and what is most relevant to them is content that addresses their needs at that specific moment. Typically this content is either time sensitive, or customized to a specific scenario. Most news has little value unless seen shortly after it is created. Customized information can deliver only the most essential details that are relevant to that moment. Customers may not want to know everything about their car lease — they only want to know about the payment for this month. Once this month’s payment question has been answered, they no longer need that information. This scenario shows how disposable content can be a subset of permanent content. Audiences may not want to view all the permanent content, and only want to view a subset of it. Alerts are one way to deliver disposable content that highlights information that is relevant, but only for a short time.

The expected audience refers to whether the content is intended for an individual, or addresses the interests of a group of individuals. Historically, nearly all online content addressed a group of people, frequently everyone. More recently, digital content has become more customized to address individual situational needs and interests, where the content one person views will not be the same as the content another views, even if the content covers the same broad topic. The content delivered can consider factors such as geolocation, viewing history, purchase history, and device settings to provide content that is more relevant to a specific individual. By extension, the more that content is adjusted to be relevant to a specific individual, the less that same content will be relevant to other individuals.

A tradeoff exists, between how widely viewed content is, and how helpful it might be to a specific individual. Generic reference content may generate many views, and be helpful to many people, but it might not provide exactly what any one of those people want. Single use content created for an individual may provide exactly what that person needed, at the specific time they viewed the content. But that content will be helpful to an single person only, unless such customization is scalable across time and different individuals.

Disposable content is moment-rich, but duration-poor. Marketing emails highlight the essential features of disposable content. People never save marketing emails, and they rarely forward them to family and friends. They rarely even open and read them, unless they are checking their email at a moment of boredom and want a distraction — fantasizing about some purchase they may not need, or wanting to feel virtuous for reading a tip they may never actually use. Disposable content sometimes generates zero views by an individual, and almost never will generate more than one view. If there’s ever a doubt about whether someone might really need the information later, publishers can add a “save for later” feature — but only when there’s a strong reason to believe a identifiable minority has a critical need to access the content again.

Publishers face two hurdles with disposable content: being able to quickly produce new content, and being able to deliver time-sensitive or urgent content to the right person when it is needed. They don’t need to worry about archiving the content, since it is no longer valuable. Disposable content is always changing, so that different people on different days will receive different content.

With permanent content, publishers need to worry about managing existing content, and having a process for updating it. Publishers become concerned with consistency, tracking changes, and versioning. These tasks are less frenetic than those for disposable content, but they can be more difficult to execute well. It is easy to keep adding layers of new material on top of old material, while failing to indicate what’s now important, and for whom.

Content that’s used repeatedly, but is customized to specific individual needs, can present tricky information architecture challenges. These can be addressed by having a user login to a personal account, where their specific content is stored and accessible.

Strategies for Fast and Slow Content: Operations Fit to Purpose

All publishers operate somewhere along a spectrum. One end emphasizes quick turn-around, short-lived content (such as news organizations), and the other end emphasizes slowly evolving, long-lived content (such as healthcare advice for people with chronic conditions.) Many organizations will publish a mix of fast and slow content. But it’s important for organizations to understand whether they are primarily a fast or slow content publisher, so that they can decide on the best strategy to support their publishing goals.

Most organizations will be biased toward either fast or slow content. Fast moving consumer goods, unsurprisingly, tend to create fast content. In contrast, heavy equipment manufacturers, whose products may last for decades, tend to generate slow content that’s revised and used over a long period.

Different roles in organizations gravitate toward either fast or slow content. Consider a software company. Marketers will blitz customers with new articles talking about how revolutionary the latest release of their software is. Customer support may be focused on revising existing content about the product, and reassuring customers that the changes aren’t frightening, but easy to learn and not disruptive. Even if the new release generates a new instance of technical documentation devoted to that release, the documentation will reuse much of the content from previous releases, and will essentially be a revision to existing content, rather than fundamentally new content.

Fast content is different from slow content

Some marketers want their copywriters to become more like journalists, and have set up “newsrooms” to churn out new content. When emulating journalists, marketers are sticking with the familiar fast content paradigm, where content is meant to be read once only, preferably soon after it’s been created. Most news gets old quickly, unless it is long form journalism that addresses long-term developments. Marketing content frequently has a lifespan of a mosquito.

Marketing content tends to focus on:

Creating new content, or
Repackaging existing content, and
Making stuff sound new (and therefore attention worthy)

For fast content, production agility is essential.

Non-marketing content has a different profile. Non-marketing content includes advisory information from government or health organizations, and product content, such as technical documentation, product training, online support content, and other forms of UX content such as on-screen instructions. Such content is created for the long term, and tends to emphasize that it is solid, reliable and up-to-date. Rather than creating lots of new content, existing content tends to evolve. It gets updated, and expands as products gain features or knowledge grows. It may lead with what’s new, but will build on what’s been created already.

Much non-marketing content is permanent content about a fixed set of topics. The key task is not brainstorming new topics to talk about, but keeping published information up-to-date. New permanent topics are rare. When new topics are necessary, it’s common for new topics to emerge as branches of an existing topic.

Fast and slow content are fundamentally different in orientation. Organizations are experimenting with ways to bridge these differences. Organizations may try to make their marketing content more like product content, or conversely, make their product content more like marketing content.

Some marketing organizations are adopting technical communications methods, for example, content management practices developed for technical documentation such as DITA. Marketing communications are seeking to leverage lessons from slow content practices, and apply them to fast content, so that they can produce more content at a larger scale.

Marketers want their content to become more targeted. They want to componentize content so they can reuse content elements in endless combinations. They embrace reuse, not as a path to revise existing content, but as a mechanism to push out new content quickly, using automation. At its best, such automation can address the interests of audiences more precisely. At its worst, content automation becomes a fatigue-inducing, attention-fragmenting experience for audiences, who are constantly goaded to view messages without ever developing an understanding . Content reuse is a poor strategy for getting attention from audiences. New content, when generated from the reuse of existing content components, never really expresses new ideas. It just recombines existing information.

Some technical communicators, who develop slow content, are implementing practices associated with marketing communications. Rather than only producing permanent documents to read, technical communication teams are seeking to push specific disposable messages to resolve issues. Technical communication teams are embracing more push tactics, such as developing editorial calendars, to highlight topics to send to audiences, instead of waiting for audiences to come to them. These teams are seeking to become more agile, and targeted, in the content they produce.

As the boundaries between the practices of fast and slow content begin to overlap, delivery becomes more important. Publishers need to choose between targeted verses non-targeted delivery. They must decide of their content will be customized and dynamically created according to user variables, or pre-made to anticipate user needs.

The value of fast content depends above all on the accuracy of its targeting. There is no point creating disposable content if it doesn’t resolve a problem for users. If publishers rely on fast content, but can’t deliver it to the right users at the right time, the user may never find out the answer to their question, especially if permanent content gets neglected in the push for instant content delivery.

Generic fast content is becoming ever more difficult to manage. Individuals don’t want to see content they’ve viewed already, or decided they weren’t interested in viewing to begin with. But because generic content is meant for everyone, it is difficult to know who has seen or not seen content items. Fast generic content still has a role. Targeting has its limits. Publishers are far from being able to produce personalized content for everyone that is useful and efficient. Much content will inevitably have no repeat use. Yet fast generic content can easily become a liability that is difficult to manage. Recommendation engines based on user viewing behaviors and known preferences can help prioritize this content so that more relevant content surfaces. But publishers should be judicious when creating fast generic content, and should enforce strict rules on how long such content stays available online.

Automation is making new content easier to create, which is increasing the temptation to create more new content. Unfortunately, digital content can resemble plastic shopping bags, which are useful when first needed, but which generally never get used again, becoming waste. Publishers need to consider content reuse not just from their own parochial perspective, but from the perspective of their audiences. Do their audiences want to view their content more than once? Marketing content is the source of most fast content. Most marketing content is never read more than once. Can that ever change? Are marketers capable of producing content that has long term value to their audiences? Or will they insist on controlling the conversation, directing their customers on what content to view, and when to view it?

Creating new content is not always the right approach. Automation can make it more convenient for publishers to pursue the wrong strategy, without scrutinizing the value of such content to the organization, and its customers. Content production agility is valuable, but having robust content management is an even more strategic capability.

— Michael Andrews

Tags scope, velocity

Content Experience

Should Information be Data-Rich or Content-Rich?

Post author By Michael Andrews
Post date August 14, 2017

One of the most challenging issues in online publishing is how to strike the right balance between content and data. Publishers of online information, as a matter of habit, tend to favor either a content-centric, or a data-centric approach. Publishers may hold deep seeded beliefs about what form of information is most valuable. Some believe that compelling stories will wow audiences. Others expect that new artificial intelligence agents, providing instantaneous answers, will delight them. This emphasis on information delivery can overshadow consideration of what audiences really need to know and do. How information is delivered can get in the way of what the audience needs. Instead of delight, audiences experience apathy and frustration. The information fails to deliver the right balance between facts, and explanation.

The Cultural Divide

Information on the web can take different forms. Perhaps the most fundamental difference is whether online information provides a data-rich or content-rich experience. Each form of experience has its champions, who promote the virtues of data (or content). Some go further, and dismiss the value of the approach they don’t favor, arguing that content (or data) actually gets in the way of what users want to know.

A (arguing for data-richness): Customers don’t want to read all that text! They just want the facts.
B (arguing for content-richness): Just showing facts and figures will lull customers to sleep!

Which is more important, offering content or data? Do users want explanations and interpretations, or do they just want the cold hard facts? Perhaps it depends on the situation, you think. Think of a situation where people need information. Do they want to read an explanation and get advice, or do they want a quick unambiguous answer that doesn’t involve reading (or listening to a talking head)? The scenario you have in mind, and how you imagine people’s needs in that scenario, probably reveals something about your own preferences and values. Do you like to compare data when making decisions, or do you like to consider commentary? Do your own PowerPoint slides show words and images, or do they show numbers and graphs? Did you study a content-centric discipline such as the humanities in university, or did you study a data-centric one such as commerce or engineering? What are your own definitions of what’s helpful or boring?

Our attitudes toward content and data reflect how we value different forms of information. Some people favor more generalized and interpreted information, and others prefer specific and concrete information. Different people structure information in different ways, through stories for example, or by using clearly defined criteria to evaluate and categorize information. These differences may exist within your target audience, just as they may show up within the web team trying to deliver the right information to that audience. People vary in their preferences. Individuals may shift their personal preferences depending on topic or situation. What form of information audiences will find most helpful can elude simple explanations.

Content and data have an awkward relationship. Each seems to involve a distinct mode of understanding. Each can seem to interrupt the message of the other. When relying on a single mode of information, publishers risk either over-communicating, or under-communicating.

Content and Data in Silhouette

To keep things simple (and avoid conceptual hairsplitting), let’s think about data as any values that are described with an attribute. We can consider data as facts about something. Data can be any kind of fact about a thing; it doesn’t need to be a number. Whether text or numeric, data are values that can be counted.

Content can involve many distinct types, but for simplicity, we’ll consider content as articles and videos — containers where words and images combine to express ideas, stories, instructions, and arguments.

Both data and content can inform. Content has the power to persuade, as sometimes data can possess that power as well. So what is the essential difference between them? Each has distinct limitations.

The Limits of Content

In certain situations content can get in the way of solving user problems. Many times people are in a hurry, and want to get a fact as quickly as possible. Presenting data directly to audiences doesn’t always mean people get their questioned answered instantly, of course. Some databases are lousy answering questions for ordinary people who don’t use databases often. But a growing range of applications now provide “instant answers” to user queries by relying on data and computational power. Whereas content is a linear experience, requiring time to read, view or listen, data promises instant experience that can gratify immediately. After all, who wants to waste their customer’s time? Content strategy has long advocated solving audience problems as quickly as possible. Can data obviate the need for linear content?

“When you think about something and don’t really know much about it, you will automatically get information. Eventually you’ll have an implant where if you think about a fact, it will just tell you the answer.” Google’s Larry Page, in Steven Levy’s “In the Plex”.

The argument that users don’t need websites (and their content) is advanced by SEO expert Aaron Bradley in his article “Zero Blue Links: Search After Traffic”. Aaron asks us to “imagine a world in which there was still an internet, but no websites. A world in which you could still look for and find information, but not by clicking from web page to web page in a browser.”

Aaron notes that within Google search results, increasingly it is “data that’s being provided, rather than a document summary.” Audiences can see a list of product specs, rather than a few sentences that discuss those specs. He sees this as the future of how audiences will access information on different devices. “Users of search engines will increasingly be the owners of smart phones and smart watches and smart automobiles and smart TVs, and will come to expect seamless, connected, data-rich internet experiences that have nothing whatsoever to do with making website visits.”

In Aaron’s view, we are seeing a movement from “documents to data” on the web. “The evolution of search results in terms of the gradual supplanting of document references by data than it is to infer that direction through the enumeration of individual features.” No need to read a document: search results will answer the question. It’s an appealing notion, and one that is becoming more commonplace. Content isn’t always necessary if clear, unambiguous data is available that can answer the question.

Google, or any search engine, is just a channel — an important one for sure, but not the end-all and be-all. Search engines locate information created by others, but unless they have rights to that information, they are limited in what they do with it. Yet the principles here can apply to other kinds of interactive apps, channels and platforms that let users get information instantly, without wading through articles or videos. So is content now obsolete?

There is an important limitation to considering SEO search results as data. Even though the SEO community refers to search metadata as “structured data”, the use of this term is highly misleading. The values described by the metadata aren’t true data that can be counted. They are values to display, or are links to other values. The problem with structured data as currently practiced is that is doesn’t enforce how the values need to be described. The structured data values are never validated, so computers can’t be sure if two prices appearing on two random websites are both quoting the same currency, even if both mention dollars. SEO structured data rarely requires controlled vocabulary for text values, and most of its values doesn’t include or mandate data typing that computers would need to aggregate and compare different values. Publishers are free to use most any kind of text value they like in many situations. The reality of SEO structured data is less glamorous than its image: much of the information described by SEO structured data is display content for humans to read, rather than data for machines to transform. The customers who scan Google’s search results are people, not machines. People still need to evaluate the information, and decide its credibility and relevance. The values aren’t precise and reliable enough for computers to make such judgements.

When an individual wants to know what time a shop closes, it’s a no brainer to provide exactly that information, and no more. The strongest cases for presenting data directly is when the user already knows exactly what they want to know, and they will understand the meaning and significance of the data shown. These are the “known unknowns” (or “knowns but forgotten”) use cases. Plenty of such cases exist. But while the lure of instant gratification is strong, people aren’t always in a rush to get answers, and in many cases they shouldn’t be in a rush, because the question is bigger than a single answer can address.

The Limits of Data

Data in various circumstances can get in the way of what interests audiences. At a time when the corporate world increasingly extols the virtues of data, it’s important to recognize when data can be useless, because it doesn’t answer questions that audiences have. Publishers should identify when data is oversold, as always being what audiences want. Unless data reflects audiences priorities, the data is junk as far as audiences are concerned.

Data can bring credibility to content, though has the potential to confuse and mislead as well. Audiences can be blinded by data when it is hard to comprehend, or is too voluminous. Audiences need to be interested in the data for it to provide them with value. Much of the initial enthusiasm for data journalism, the idea of writing stories based on the detailed analysis of facts and statistics, has receded. Some stories have been of high quality, but many weren’t intrinsically interesting to large numbers of viewers. Audiences didn’t necessarily see themselves in the minutiae, or feel compelled to interact with raw material being offered to them. Data journalism stories are different from commercially oriented information, which have well defined use cases specifying how people will interact with data. Data journalism can presume people will be interested in topics simply because public data on these topics is available. However, this data may be collected for a different purpose, often for technical specialists. Presenting it doesn’t transform it into something interesting to audiences.

The experience of data journalism shows that not all data is intrinsically interesting or useful to audiences. But some technologists believe that making endless volumes of data available is intrinsically worthwhile, because machines have the power to unlock value from the data that can’t be anticipated.

The notion that “data is God” has fueled the development of the semantic web approach, which has subsequently been rebranded as “linked data”. The semantic web has promised many things, including giving audiences direct access to information without the extraneous baggage of content. It even promised to make audiences irrelevant in many cases, by handing over data to machines to act on, so that audiences don’t even need to view that data. In its extreme articulation, the semantic web/linked data vision considers content as irrelevant, and even audiences as irrelevant.

These ideas, while still alive and championed by their supporters, have largely failed to live up to expectations. There are many reasons for this failure, but a key one has been that proponents of linked data have failed to articulate its value to publishers and audiences. The goal of linked data always seems to be to feed more data to the machine. Linked data discussions get trapped in the mechanics of what’s best for machines (de-referencable URIs, machine values that have no intrinsic meaning to humans), instead of what’s useful for people.

The emergence of schema.org (the structured data standard used in SEO) represents a step back from such machine-centric thinking, to accommodate at least some of the needs of human metadata creators by allowing text values. But schema.org still doesn’t offer much in the way of controlled vocabularies for values, which would be both machine-reliable and human-friendly. It only offers a narrow list of specialized “enumerations”, some of which are not easy-to-read text values.

Schema.org has lots of potential, but its current capabilities get over-hyped by some in the SEO community. Just as schema.org metadata should not be considered structured data, it is not really the semantic web either. It’s unable to make inferences, which was a key promise of the semantic web. Its limitations show why content remains important. Google’s answer to the problem of how to make structured data relevant to people was the rich snippet. Rich snippets displayed in Google search results are essentially a vanity statement. Sometimes these snippets answer the question, but other times they simply tease the user with related information. Publishers and audiences alike may enjoy seeing an extract of content in search results, and certainly rich snippets are a positive development in search. But displaying extracts of information does not represent an achievement of the power of data. A list of answers supplied by rich snippets is far less definitive than a list of answers supplied by a conventional structured query database — an approach that has been around for over three decades.

The value of data comes from its capacity to aggregate, manipulate and compare information relating to many items. Data can be impactful when arranged and processed in ways that change an audience’s perception and understanding of a topic. Genuine data provides values that can be counted and transformed, something that schema.org doesn’t support very robustly, as previously mentioned. Google’s snippets, when parsing metadata values from articles, simply display fragments from individual items of content. A list of snippets doesn’t really federate information from multiple sources into a unified, consolidated answer. If you ask Google what store sells the cheapest milk in your city, Google can’t directly answer that question, because that information is not available as data that can be compared. Information retrieval (locating information) is not the same as data processing (consolidating information).

“What is the point of all that data? A large data set is a product like any other. It must be maintained and updated, given attention. What are we to make of it?” Paul Ford in “Usable Data”

But let’s assume that we do have solid data that machines can process without difficulty. Can that data provide audiences with what they need? Is content unnecessary when the data is machine quality? Some evidence suggests that even the highest quality linked data isn’t sufficient to interest audiences.

The museum sector has been interested in linked data for many years. Unlike most web publishers, they haven’t been guided by schema.org and Google. They’ve been developing their own metadata standards. Yet this project has had its problems. The data lead of a well known art museum complained recently of the “fetishization of Linked Open Data (LOD)”. Many museums approached data as something intrinsically valuable, without thinking through who would use the data, and why. Museums reasoned that they have lots of great content (their collections) and that they needed to provide information about their collections online to everyone, so that linked data was the way to do that. But the author notes: ‘“I can’t wait to see what people do with our data” is not a clear ROI.’ When data is considered as the goal, instead of as a means to a goal, then audiences get left out of the picture. This situation is common to many linked data projects, where getting data into a linked data structure becomes an all consuming end, without anchoring the project in audience and business needs. For linked data to be useful, it needs to address specific use cases for people relying on the data.

Much magical thinking about linked data involves two assumptions: that the data will answer burning questions audiences have, and these answers will be sufficient to make explanatory content unnecessary. When combined, these assumptions become one: everything you could possibly want to know is now available as a knowledge graph.

The promise that data can answer any question is animating development of knowledge graphs and “intelligent assistants” by nearly every big tech company: Google, Bing, LinkedIn, Apple, Facebook, etc. This latest wave of data enthusiasm again raises questions whether content is becoming less relevant.

Knowledge graphs are a special form of linked data. Instead of the data living in many places, hosted by many different publishers, the data is instead consolidated into a single source curated by one firm, for example, Bing. A knowledge graph combines millions of facts about all kinds of things into a single data set. A knowledge graph creator generally relies on other publisher’s linked data. But it assumes responsibility for validating that data itself when incorporating the information in its knowledge graph. In principle, the information is more reliable, both factually and technically.

Knowledge graphs work best for persistent data (the birth year of a celebrity) but less well for high velocity data that can change frequently (the humidity right now). Knowledge graphs can be incredibly powerful. They can allow people to find connections between pieces of data that might not seem related, but are. Sometimes these connections are simply fun trivia (two famous people born in the same hospital on the same day). Other times these connections are significant as actionable information. Because knowledge graphs hold so much potential, it is often difficult to know how they can be used effectively. Many knowledge graph use cases relate to open ended exploration, instead of specific tasks that solve well defined user problems. Few people can offer a succinct, universally relevant reply to the question: “What problem does a knowledge graph solve?” Most of the success I’ve seen for knowledge graphs has been in specialized vertical applications aimed at researchers, such as biomedical research or financial fraud investigations. To be useful to general audiences, knowledge graphs require editorial decisions that queue up on-topic questions, and return information relevant to audience needs and interests. Knowledge graphs are less useful when they simply provide a dump of information that’s related to a topic.

Knowledge graphs combine aspects of Wikipedia (the crowdsourcing of data) with aspects of a proprietary gatekeeping platform such as Facebook (the centralized control of access to and prioritization of information). No one party can be expected to develop all the data needed in a knowledge graph, yet one party needs to own the graph to make it work consistently — something that doesn’t always happen with linked data. The host of the knowledge graph enjoys a privileged position: others must supply data, but have no guarantee of what they receive in return.

Under this arrangement, suppliers of data to a knowledge graph can’t calculate their ROI. Publishers are back in the situation where they must take a leap of faith that they’ll benefit from their effort. Publishers are asked to supply data to a service on the basis of a vague promise that the service will provide their customers with helpful answers. Exactly how the service will use the data is often not transparent. Knowledge graphs don’t reveal what data gets used, and when. Publisher also know their rivals are also supplying data to the same graph. The faith-based approach to developing data, in hopes that it will be used, has a poor track record.

The context of data retrieved from a knowledge graph may not be clear. Google, Siri, Cortana, or Alexa may provide an answer. But on what basis do they make that judgment? The need for context to understand the meaning of data leads us back to content. What a fact means may not be self-evident. Even facts that seem straightforward can depend on qualified definitions.

“A dataset precise enough for one purpose may not be sufficiently precise for another. Data on the Web may be wrong, or wrong in some context—with or without intent.” Bernstein, Hendler & Noy

The interaction between content and data is becoming even more consequential as the tech industry promotes services incorporating artificial intelligence. In his book Free Speech, Timothy Garton Ash shared his experience using WolfamAlpha, a semantic AI platform that competes with IBM Watson, and that boldly claims to make the “world’s knowledge computable.” When Ash asked WolfamAlpha “How free should speech be?”, it replied: “WolframAlpha doesn’t understand your query.” This kind of result is entirely expected, but it is worth exploring why something billed as being smart fails to understand. Conversational interfaces, after all, are promising to answer our questions. Data needs to exist for questions to get answers. For data to operate independently of content, an answer must be expressible as data. But many answers can’t be reduced to one or two values. Sometimes they involve many values. Sometimes answers can’t be expressed as a data value at all. This actuality means that content will always be necessary for some answers.

Data as a Bridge to Content

Data and content have different temperaments. The role of content is often to lead the audience to reveal what’s interesting. The role of data is frequently to follow the audience as they indicate their interests. Content and data play complementary roles. Each can be incomplete without the other.

Content, whether articles, video or audio, is typically linear. Content is meant to be consumed in a prescribed order. Stories have beginnings and ends, and procedures normally have fixed sequences of steps. Hyperlinking content provides a partial solution to making a content experience less linear, when that is desired. Linear experiences can be helpful when audiences need orientation, but they are constraining when such orientation isn’t necessary.

Data, to be seen, must first be selected. Publishers must select what data to highlight, or they must delegate that task to the audience. Data is non-linear: it can be approached in any order. It can be highly interactive, providing audiences with the ability to navigate and explore the information in any order, and change the focus of the information. With that freedom comes the possibility that audiences get lost, unable to identify information of value. What data means is highly dependent on the audience’s previous understanding. Data can be explained with other data, but even these explanations require prior knowledge.

From an audience perspective, data plays various roles. Sometimes data is an answer, and the end of a task. Sometimes data is the start of a larger activity. Data is sometimes a signal that a topic should be looked at more closely. Few people decide to see a movie based on an average rating alone. A high rating might prompt someone to read about the film. Or the person may be already be interested in reading about the film, and consults the average rating simply to confirm their own expectation of whether they’d like it. Data can be an entryway into a topic, and a point of comparison for audiences.

Writers can undervalue data because they want to start with the story they wish to tell, rather than the question or fact that prompts initial interest from the audience. Audiences often begin exploration by seeking out a fact. But what that fact may be will be different according to each individual. Content needs facts to be discovered.

Data evangelists can undervalue content because they focus on the simple use cases, and ignore the messier ones. Data can answer questions only in some situations. In an ideal world, a list of questions and answers get paired together as data. Just match the right data with the right question. But audiences may find it difficult to articulate the right question, or they may not know what question to ask. Audiences may find they need to ask so many specific questions to develop a broad understanding. They may find the process of asking questions exhausting. Search engines and intelligent agents aren’t going to Socratically enlighten us about new or unfamiliar topics. Content is needed.

Ultimately, whether data or content is most important depends on how much communication is needed to support the audience. Data supplies answers, but doesn’t communicate ideas. Content communicates ideas, but can fail to answer if it lacks specific details (data) that audiences expect.

No bold line divides data from content. Even basic information, such as expressing how to do something, can be approached either episodically as content, or atomically as data. Publishers can present the minimal facts necessary to perform a task (the must do’s), or they can provide a story about possibilities of tasks to do (the may do’s). How should they make that decision?

In my experience, publishers rarely create two radically alternative versions of online information, a data-centric and content-centric version, and test these against each other to see which better meets audience needs. Such an approach could help publishers understand what the balance between content and data needs to be. It could help them understand how much communication is required, so the information they provide is never in the way of the audience’s goals.

— Michael Andrews

Tags data journalism, linked_data