Categories
Content Experience

Should Information be Data-Rich or Content-Rich?

One of the most challenging issues in online publishing is how to strike the right balance between content and data.  Publishers of online information, as a matter of habit, tend to favor either a content-centric, or a data-centric approach.  Publishers may hold deep seeded beliefs about what form of information is most valuable.  Some believe that compelling stories will wow audiences. Others expect that new artificial intelligence agents, providing instantaneous answers, will delight them. This emphasis on information delivery can overshadow consideration of what audiences really need to know and do. How information is delivered can get in the way of what the audience needs. Instead of delight, audiences experience apathy and frustration. The information fails to deliver the right balance between facts, and explanation.

The Cultural Divide

Information on the web can take different forms. Perhaps the most fundamental difference is whether online information provides a data-rich or content-rich experience. Each form of experience has its champions, who promote the virtues of data (or content).  Some go further, and dismiss the value of the approach they don’t favor, arguing that content (or data) actually gets in the way of what users want to know.

  • A (arguing for data-richness): Customers don’t want to read all that text!  They just want the facts.  
  • B (arguing for content-richness): Just showing facts and figures will lull customers to sleep!

Which is more important, offering content or data?  Do users want explanations and interpretations, or do they just want the cold hard facts?  Perhaps it depends on the situation, you think.  Think of a situation where people need information.  Do they want to read an explanation and get advice, or do they want a quick unambiguous answer that doesn’t involve reading (or listening to a talking head)?  The scenario you have in mind, and how you imagine people’s needs in that scenario, probably reveals something about your own preferences and values.  Do you like to compare data when making decisions, or do you like to consider commentary?  Do your own PowerPoint slides show words and images, or do they show numbers and graphs? Did you study a content-centric discipline such as the humanities in university, or did you study a data-centric one such as commerce or engineering? What are your own definitions of what’s helpful or boring?

Our attitudes toward content and data reflect how we value different forms of information.  Some people favor more generalized and interpreted information, and others prefer specific and concrete information.  Different people structure information in different ways, through stories for example, or by using clearly defined criteria to evaluate and categorize information.  These differences may exist within your target audience, just as they may show up within the web team trying to deliver the right information to that audience.  People vary in their preferences. Individuals may shift their personal  preferences depending on topic or situation.  What form of information audiences will find most helpful can elude simple explanations.

Content and data have an awkward relationship. Each seems to involve a distinct mode of understanding.  Each can seem to interrupt the message of the other. When relying on a single mode of information, publishers risk either over-communicating, or under-communicating.

Content and Data in Silhouette

To keep things simple (and avoid conceptual hairsplitting), let’s think about data as any values that are described with an attribute.  We can consider data as facts about something.  Data can be any kind of fact about a thing; it doesn’t need to be a number. Whether text or numeric, data are values that can be counted.

Content can involve many distinct types, but for simplicity, we’ll consider content as articles and videos — containers  where words and images combine to express ideas, stories, instructions, and arguments.

Both data and content can inform.  Content has the power to persuade, as sometimes data can possess that power as well.  So what is the essential difference between them?  Each has distinct limitations.

The Limits of Content

In certain situations content can get in the way of solving user problems.  Many times people are in a hurry, and want to get a fact as quickly as possible.  Presenting data directly to audiences doesn’t always mean people get their questioned answered instantly, of course.  Some databases are lousy answering questions for ordinary people who don’t use databases often.  But a growing range of applications now provide “instant answers” to user queries by relying on data and computational power.  Whereas content is a linear experience, requiring time to read, view or listen, data promises instant experience that can gratify immediately.  After all, who wants to waste their customer’s time?  Content strategy has long advocated solving audience problems as quickly as possible.  Can data obviate the need for linear content?

“When you think about something and don’t really know much about it, you will automatically get information.  Eventually you’ll have an implant where if you think about a fact, it will just tell you the answer.”  Google’s Larry Page, in Steven Levy’s  “In the Plex”.

The argument that users don’t need websites (and their content) is advanced by SEO expert Aaron Bradley in his article “Zero Blue Links: Search After Traffic”.   Aaron asks us to “imagine a world in which there was still an internet, but no websites. A world in which you could still look for and find information, but not by clicking from web page to web page in a browser.”

Aaron notes that within Google search results, increasingly it is “data that’s being provided, rather than a document summary.”  Audiences can see a list of product specs, rather than a few sentences that discuss those specs. He sees this as the future of how audiences will access information on different devices.  “Users of search engines will increasingly be the owners of smart phones and smart watches and smart automobiles and smart TVs, and will come to expect seamless, connected, data-rich internet experiences that have nothing whatsoever to do with making website visits.”

In Aaron’s view, we are seeing a movement from “documents to data” on the web. “The evolution of search results in terms of the gradual supplanting of document references by data than it is to infer that direction through the enumeration of individual features.”  No need to read a document: search results will answer the question.  It’s an appealing notion, and one that is becoming more commonplace.  Content isn’t always necessary if clear, unambiguous data is available that can answer the question.

Google, or any search engine, is just a channel — an important one for sure, but not the end-all and be-all.  Search engines locate information created by others, but unless they have rights to that information, they are limited in what they do with it. Yet the principles here can apply to other kinds of interactive apps, channels and platforms that let users get information instantly, without wading through articles or videos.  So is content now obsolete?

There is an important limitation to considering SEO search results as data.  Even though the SEO community refers to search metadata as “structured data”, the use of this term is highly misleading.  The values described by the metadata aren’t true data that can be counted.  They are values to display, or are links to other values.  The problem with structured data as currently practiced is that is doesn’t enforce how the values need to be described.  The structured data values are never validated, so computers can’t be sure if two prices appearing on two random websites are both quoting the same currency, even if both mention dollars.  SEO structured data rarely requires controlled vocabulary for text values, and most of its values doesn’t include or mandate data typing that computers would need to aggregate and compare different values.  Publishers are free to use most any kind of text value they like in many situations.   The reality of SEO structured data is less glamorous than its image: much of the information described by SEO structured data is display content for humans to read, rather than data for machines to transform.  The customers who scan Google’s search results are people, not machines.  People still need to evaluate the information, and decide its credibility and relevance.  The values aren’t precise and reliable enough for computers to make such judgements.

When an individual wants to know what time a shop closes, it’s a no brainer to provide exactly that information, and no more. The strongest cases for presenting data directly is when the user already knows exactly what they want to know, and they will understand the meaning and significance of the data shown.  These are the “known unknowns” (or “knowns but forgotten”) use cases.  Plenty of such cases exist.  But while the lure of instant gratification is strong, people aren’t always in a rush to get answers, and in many cases they shouldn’t be in a rush, because the question is bigger than a single answer can address.

The Limits of Data

Data in various circumstances can get in the way of what interests audiences.  At a time when the corporate world increasingly extols the virtues of data, it’s important to recognize when data can be useless, because it doesn’t answer questions that audiences have.  Publishers should identify when data is oversold, as always being what audiences want.  Unless data reflects audiences priorities, the data is junk as far as audiences are concerned.

Data can bring credibility to content, though has the potential to confuse and mislead as well.  Audiences can be blinded by data when it is hard to comprehend, or is too voluminous. Audiences need to be interested in the data for it to provide them with value.  Much of the initial enthusiasm for data journalism, the idea of writing stories based on the detailed analysis of facts and statistics, has receded.  Some stories have been of high quality, but many weren’t intrinsically interesting to large numbers of viewers.  Audiences didn’t necessarily see themselves in the minutiae, or feel compelled to interact with raw material being offered to them.  Data journalism stories are different from commercially oriented information, which have well defined use cases specifying how people will interact with data.  Data journalism can presume people will be interested in topics simply because public data on these topics is available.  However, this data may be collected for a different purpose, often for technical specialists.  Presenting it doesn’t transform it into something interesting to audiences.

The experience of data journalism shows that not all data is intrinsically interesting or useful to audiences.  But some technologists believe that making endless volumes of data available is intrinsically worthwhile, because machines have the power to unlock value from the data that can’t be anticipated.

The notion that “data is God” has fueled the development of the semantic web approach, which has subsequently been  rebranded as “linked data”.  The semantic web has promised many things, including giving audiences direct access to information without the extraneous baggage of content.  It even promised to make audiences irrelevant in many cases, by handing over data to machines to act on, so that audiences don’t even need to view that data.  In its extreme articulation, the semantic web/linked data vision considers content as irrelevant, and even audiences as irrelevant.

These ideas, while still alive and championed by their supporters, have largely failed to live up to expectations.  There are many reasons for this failure, but a key one has been that proponents of linked data have failed to articulate its value to publishers and audiences. The goal of linked data always seems to be to feed more data to the machine.  Linked data discussions get trapped in the mechanics of what’s best for machines (de-referencable URIs,  machine values that have no intrinsic meaning to humans), instead of what’s useful for people.

The emergence of schema.org (the structured data standard used in SEO) represents a step back from such machine-centric thinking, to accommodate at least some of the needs of human metadata creators by allowing text values. But schema.org still doesn’t offer much in the way of controlled vocabularies for values, which would be both machine-reliable and human-friendly.  It only offers a narrow list of specialized “enumerations”, some of which are not easy-to-read text values.

Schema.org has lots of potential, but its current capabilities get over-hyped by some in the SEO community.  Just as schema.org metadata should not be considered structured data, it is not really the semantic web either.  It’s unable to make inferences, which was a key promise of the semantic web.  Its limitations show why content remains important. Google’s answer to the problem of how to make structured data relevant to people was the rich snippet.  Rich snippets displayed in Google search results are essentially a vanity statement. Sometimes these snippets answer the question, but other times they simply tease the user with related information.  Publishers and audiences alike may enjoy seeing an extract of content in search results, and certainly rich snippets are a positive development in search. But displaying extracts of information does not represent an achievement of the power of data.  A list of answers supplied by rich snippets is far less definitive than a list of answers supplied by a conventional structured query database — an approach that has been around for over three decades.

The value of data comes from its capacity to aggregate, manipulate and compare information relating to many items.  Data can be impactful when arranged and processed in ways that change an audience’s perception and understanding of a topic. Genuine data provides values that can be counted and transformed, something that schema.org doesn’t support very robustly, as previously mentioned.  Google’s snippets, when parsing metadata values from articles, simply display fragments  from individual items of content.  A list of snippets doesn’t really federate information from multiple sources into a unified, consolidated answer.  If you ask Google what store sells the cheapest milk in your city, Google can’t directly answer that question, because that information is not available as data that can be compared.  Information retrieval (locating information) is not the same as data processing (consolidating information).

“What is the point of all that data? A large data set is a product like any other. It must be maintained and updated, given attention. What are we to make of it?”  Paul Ford in “Usable Data

But let’s assume that we do have solid data that machines can process without difficulty.  Can that data provide audiences with what they need?  Is content unnecessary when the data is machine quality?  Some evidence suggests that even the highest quality linked data isn’t sufficient to interest audiences.

The museum sector has been interested in linked data for many years.  Unlike most web publishers, they haven’t been guided by schema.org and Google.  They’ve been developing their own metadata standards.  Yet this project has had its problems.  The data lead of a well known art museum complained recently of the “fetishization of Linked Open Data (LOD)”.  Many museums approached data as something intrinsically valuable, without thinking through who would use the data, and why.  Museums reasoned that they have lots of great content (their collections) and that they needed to provide information about their collections online to everyone, so that linked data was the way to do that.  But the author notes: ‘“I can’t wait to see what people do with our data” is not a clear ROI.’  When data is considered as the goal, instead of as a means to a goal, then audiences get left out of the picture.  This situation is common to many linked data projects, where getting data into a linked data structure becomes an all consuming end, without anchoring the project in audience and business needs.  For linked data to be useful, it needs to address specific use cases for people relying on the data.

Much magical thinking about linked data involves two assumptions: that the data will answer burning questions audiences have, and these answers will be sufficient to make explanatory content unnecessary.  When combined, these assumptions become one: everything you could possibly want to know is now available as a knowledge graph.

The promise that data can answer any question is animating development of knowledge graphs and “intelligent assistants” by nearly every big tech company: Google, Bing, LinkedIn, Apple, Facebook, etc.  This latest wave of data enthusiasm again raises questions whether content is becoming less relevant.

Knowledge graphs are a special form of linked data.  Instead of the data living in many places, hosted by many different publishers, the data is instead consolidated into a single source curated by one firm, for example, Bing. A knowledge graph combines millions of facts about all kinds of things into a single data set. A knowledge graph creator generally relies on other publisher’s linked data. But it assumes responsibility for validating that data itself when incorporating the information in its knowledge graph.  In principle, the information is more reliable, both factually and technically.

Knowledge graphs work best for persistent data (the birth year of a celebrity) but less well for high velocity data that can change frequently (the humidity right now).   Knowledge graphs can be incredibly powerful.  They can allow people to find connections between pieces of data that might not seem related, but are.  Sometimes these connections are simply fun trivia (two famous people born in the same hospital on the same day). Other times these connections are significant as actionable information.  Because knowledge graphs hold so much potential, it is often difficult to know how they can be used effectively.   Many knowledge graph use cases relate to open ended exploration, instead of specific tasks that solve well defined user problems.   Few people can offer a succinct, universally relevant reply to the question: “What problem does a knowledge graph solve?” Most of the success I’ve seen for knowledge graphs has been in specialized vertical applications aimed at researchers, such as biomedical research or financial fraud investigations.  To be useful to general audiences, knowledge graphs require editorial decisions that queue up on-topic questions, and return information relevant to audience needs and interests.  Knowledge graphs are less useful when they simply provide a dump of information that’s related to a topic.

Knowledge graphs combine aspects of Wikipedia (the crowdsourcing of data) with aspects of a proprietary gatekeeping platform such as Facebook (the centralized control of access to and prioritization of information).  No one party can be expected to develop all the data needed in a knowledge graph, yet one party needs to own the graph to make it work consistently — something that doesn’t always happen with linked data.   The host of the knowledge graph enjoys a privileged position: others must supply data, but have no guarantee of what they receive in return.

Under this arrangement, suppliers of data to a knowledge graph can’t calculate their ROI. Publishers are back in the situation where they must take a leap of faith that they’ll benefit from their effort.  Publishers are asked to supply data to a service on the basis of a vague promise that the service will provide their customers with helpful answers.  Exactly how the service will use the data is often not transparent. Knowledge graphs don’t reveal what data gets used, and when.   Publisher also know their rivals are also supplying data to the same graph.  The faith-based approach to developing data, in hopes that it will be used, has a poor track record.

The context of data retrieved from a knowledge graph may not be clear.  Google, Siri, Cortana, or Alexa may provide an answer.  But on what basis do they make that judgment?  The need for context to understand the meaning of data leads us back to content.   What a fact means may not be self-evident. Even facts that seem straightforward can depend on qualified definitions.

“A dataset precise enough for one purpose may not be sufficiently precise for another. Data on the Web may be wrong, or wrong in some context—with or without intent.” Bernstein, Hendler & Noy

The interaction between content and data is becoming even more consequential as the tech industry promotes services incorporating artificial intelligence.  In his book Free Speech, Timothy Garton Ash shared his experience using WolfamAlpha, a semantic AI platform that competes with IBM Watson, and that boldly claims to make the “world’s knowledge computable.”  When Ash asked WolfamAlpha “How free should speech be?”, it replied: “WolframAlpha doesn’t understand your query.”   This kind of result is entirely expected, but it is worth exploring why something billed as being smart fails to understand.  Conversational interfaces, after all, are promising to answer our questions.  Data needs to exist for questions to get answers.  For data to operate independently of content, an answer must be expressible as data. But many answers can’t be reduced to one or two values.  Sometimes they involve many values.  Sometimes answers can’t be expressed as a data value at all. This actuality means that content will always be necessary for some answers.

Data as a Bridge to Content

Data and content have different temperaments.  The role of content is often to lead the audience to reveal what’s interesting.  The role of data is frequently to follow the audience as they indicate their interests. Content and data play complementary roles.  Each can be incomplete without the other.

Content, whether articles, video or audio, is typically linear.  Content is meant to be consumed in a prescribed order.   Stories have beginnings and ends, and procedures normally have fixed sequences of steps.  Hyperlinking content provides a partial solution to making a content experience less linear, when that is desired.  Linear experiences can be helpful when audiences need orientation, but they are constraining when such orientation isn’t necessary.

Data, to be seen, must first be selected. Publishers must select what data to highlight, or they must delegate that task to the audience. Data is non-linear: it can be approached in any order.  It can be highly interactive, providing audiences with the ability to navigate and explore the information in any order, and change the focus of the information.  With that freedom comes the possibility that audiences get lost, unable to identify information of value.  What data means is highly dependent on the audience’s previous understanding.  Data can be explained with other data, but even these explanations require prior  knowledge.

From an audience perspective, data plays various roles.  Sometimes data is an answer, and the end of a task.  Sometimes data is the start of a larger activity.  Data is sometimes a signal that a topic should be looked at more closely.  Few people decide to see a movie based on an average rating alone.  A high rating might prompt someone to read about the film.  Or the person may be already be interested in reading about the film, and consults the average rating simply to confirm their own expectation of whether they’d like it.  Data can be an entryway into a topic, and a point of comparison for audiences.

Writers can undervalue data because they want to start with the story they wish to tell, rather than the question or fact that prompts initial interest from the audience.   Audiences often begin exploration by seeking out a fact. But what that fact may be will be different according to each individual.  Content needs facts to be discovered.

Data evangelists can undervalue content because they focus on the simple use cases, and ignore the messier ones.  Data can answer questions only in some situations.  In an ideal world, a list of questions and answers get paired together as data. Just match the right data with the right question.  But audiences may find it difficult to articulate the right question, or they may not know what question to ask. Audiences may find they need to ask so many specific questions to develop a broad understanding.  They may find the process of asking questions exhausting.  Search engines and intelligent agents aren’t going to Socratically enlighten us about new or unfamiliar topics.  Content is needed.

Ultimately, whether data or content is most important depends on how much communication is needed to support the audience.  Data supplies answers, but doesn’t communicate ideas.  Content communicates ideas, but can fail to answer if it lacks specific details (data) that audiences expect.

No bold line divides data from content.  Even basic information, such as expressing how to do something, can be approached either episodically as content, or atomically as data.  Publishers can present the minimal facts necessary to perform a task (the must do’s), or they can provide a story about possibilities of tasks to do (the may do’s).  How should they make that decision?

In my experience, publishers rarely create two radically alternative versions of online information, a data-centric and content-centric version, and test these against each other to see which better meets audience needs.  Such an approach could help publishers understand what the balance between content and data needs to be.  It could help them understand how much communication is required, so the information they provide is never in the way of the audience’s goals.

— Michael Andrews

Categories
Content Sharing

Thinking Beyond Semantic Search

Publishers are quickly adopting semantic markup, yet often get less value from it than they could. They don’t focus on how audiences can directly access and use their semantically-described content. Instead, publishers rely on search engines to boost their engagement with audiences. But there are limits to what content, and how much content, search engines will present to audiences.  Publishers should leverage their investment in semantic markup.  Semantically-described content can increase the precision and flexibility of content delivery.  To realize the full benefits of semantic markup, publishers need APIs and apps that can deliver more content, directly to their audiences, to help individuals explore content that’s intriguing and relevant.

The Value of Schema.org Markup

Semantic search is a buzzy topic now. With the encouragement of Google, SEO consultants promote marking up content with Schema.org so that Google can learn what the content is. A number of SEO consultants suggest that brands can use their markup to land a coveted spot in Google’s knowledge graph, and show up in Google’s answer box. There are good reasons to adopt Schema.org markup.  It may or may not boost traffic to your web pages.  It may or may not boost your brand’s visibility in search.  But it will help audiences get the information they need more quickly.  And every brand needs to be viewed as helpful, and not as creating barriers to access to information customers need.

But much of the story about semantic search is incomplete and potentially misleading. Only a few lucky organizations will manage to get their content in Google’s answer box. Google has multiple reasons to crawl content that is marked up semantically. Besides offering search results, Google is building its own knowledge database it will use for its own applications, now and in the future.  By adding semantic annotation to their content that Google robots then crawl, publishers provide Google a crowd-sourced body of structured knowledge that Google can use for purposes that may be unrelated to search results. Semantic search’s role as a fact-collection mechanism is analogous to Google’s natural-language machine learning it developed through their massive book-scanning program several years ago.

Publishers rely on Google for search visibility, and effectively grant Google permission to crawl their content unless they indicate no-robots. Publishers provide Google with raw material in a format that’s useful to Google, but they can fail to ask how that format is useful to them as publishers. As with most SEO, publishers are being told to focus on what Google wants and needs. Unless one pays close attention to what is happening with developments with Schema.org, one will get the impression that the only reason to create this metadata is to please Google.  Google is so dominant that it seems as if it is entirely Google’s show.  Phil Archer, data activity lead at the W3C, has said: “Google is the killer app of the semantic web.”  Marking up content in Schema.org clearly benefits Google, but it often doesn’t help publishers nearly as much as it could.

Schema.org provides schemas “to markup HTML pages in ways recognized by major search providers, and that can also be used for structured data interoperability (e.g. in JSON).” According to its FAQs, its purpose is “to improve the web by creating a structured data markup schema supported by major search engines.”  Schema.org is first and foremost about serving the needs of search engines, though it does provide the possibility for data interoperability as well.  I want to focus on the issue of data interoperability, especially as it relates to audiences, because it is a widely neglected dimension.

Accessing Linked Data

Semantic search markup (Schema.org), linked data repositories such as GeoNames, and open content such as Wikipedia-sourced datasets of facts (DBpedia) all use a common, non-proprietary data model (RDF).  It is natural to view search engine markup as another step in the growth in the openness of the web, since more content is now described more explicitly.  Openness is a wonderful attribute: if data is not open, that implies it is being wasted, or worse, hoarded.  The goal is to publish your content as machine-intelligible data that is publicly accessible.  Because it’s on the web in a standardized format, anyone can access it, so it seems open.  But the formal guidelines that define the technological openness of open data are based more on standards-compliance by publishers than approachability by content consumers.  They are written from an engineering perspective. There is no notion of an audience in the concept of linked data. The concept presumes that the people who need the data have the technical means to access and use it.  But the reality is that much content that is considered linked data is effectively closed to the majority of people who need it, the audience for whom it was created. To access the data, they must rely on either the publisher, or a third party like Google, to give them a slice of what they seek.  So far, it’s almost entirely Google or Bing who have been making the data audience-accessible.  And they do so selectively.

Let’s look at a description of the Empire State Building in New York.  This linked data might be interesting to combine with other linked data concerning other tall buildings.  Perhaps school children will want to explore different aspects of tall buildings.  But clearly, school children won’t be able to do much with the markup themselves.

json-ld for empire state building
Schema.org description of Empire State Building in JSON-LD, via JSON-LD.org

If one searches Google for information on tall buildings, they will provide an answer that draws on semantic markup.  But while this is a nice feature, it falls short of providing the full range of information that might be of interest, and it does not allow users to explore the information the way they might wish.  One can click on items in the carousel for more details, but the interaction is based on drilling-down to more specific information, or requiring a new search query, rather than providing a contextually dynamic aggregation of information.  For example, if the student wants to find out which architect is responsible for the most tall buildings in the world, Google doesn’t offer a good way to get to that information iteratively.  If the student asks Google “which country has the most tall buildings?” she is simply given a list of search results, which includes a Wikipedia page where the information is readily available.

Relying on Google to interpret the underlying semantic markup means that the user is limited to the specific presentation that Google chooses to offer at a given time.  This dependency on Google’s choices seems far from the ideals promised by the vision of linked open data.

screenshot of google search results
Screenshot of Google search for tallest buildings

Google and Bing have invested considerable effort in making semantic search a reality: communication campaigns to encourage implementation of semantic markup, and technical resources to consume this markup to offer their customers a better search experience.  They crawl and index every word on every page, and perform an impressive range of transformations of that information to understand and use it.  But the process that the search engines use to extract meaning from content is not something that ordinary content consumers can do, and in many ways is more complicated that it needs to be.  One gets a sense of how developer-driven that semantic search markup is by looking at the fluctuating formats used by Schema.org.  There are three different markup languages (microdata, RDFa, and JSON-LD) with significantly different ways of characterizing the data.  Google’s robots are sophisticated enough to be able to interpret any of the types of markup.  But people not working for a search engine firm need to rely on something like Apache Any23, a Java library, to extract semantic content marked up in different formats.

Screenshot of Apache Any23
Screenshot of Apache Any23

Linked Data is Content that needs a User Interface

How does an ordinary person link to content described with Schema.org markup? Tim Berners-Lee famously described linked data as “browseable data.” How can we browse all this great stuff that’s out there, that’s been finally annotated so that we get the exact bits we want?  Audiences should have many avenues to retrieving content so that they can use it in the context where they need it. They need a user interface to the linked data.  We need to build this missing user interface.  For this to happen, there need to be helpful APIs, and easy-to-use consumer applications.

APIs

The goal of APIs is to find other people to promote the use of your content.  Ideally, they will use your content in ways you might not even have considered, and therefore be adding value to the content by expanding its range of potential use.

APIs play a growing role in the distribution of content.  But they often aren’t truly open in the sense they offer a wide range of options to data consumers.  APIs thus far seem to play a limited role in enabling the use of content annotated with  schema.org markup.

Getting data from an API can be a chore, even for quantitatively sophisticated people who are used to thinking about variables.  AJ Hirst, an open data advocate who teaches at the Open University, says: “For me, a stereotypical data user might be someone who typically wants to be able to quickly and easily get just the data they want from the API into a data representation that is native to the environment they are working in, and that they are familiar with working with.”

API frictions are numerous: people need to figure out what data is available, what it means, and how they can use it.  Hirst advocates more user-friendly discovery resources. “If there isn’t a discovery tool they can use from the tool they’re using or environment they’re working in, then finding data from service X turns into another chore that takes them out of their analysis context.”  His view: “APIs for users – not programmers. That’s what I want from an API.”

The other challenge is that query-possibilities for semantic content go beyond the basic functions commonly used in APIs.

Jeremiah Lee, an API designer at Fitbit, has thought about how to encourage API providers and users to think more broadly about what content is available, and how it might be used.  He notes: “REST is a great starting point for basic CRUD operations, but it doesn’t adequately explain how to work with collections, relational data, operations that don’t map to basic HTTP verbs, or data extracted from basic resources (such as stats). Hypermedia proponents argue that linked resources best enable discoverability, just as one might browse several linked articles on Wikipedia to learn about a subject. While doing so may help explain resource relationships after enough clicking, it’s not the best way to communicate concepts.”

For Linked Data, a new API standard is under development called hydra that aims to address some of the technical limitations of standard APIs that Lee mentions.  But the human challenges remain, and the richer the functionality offered by an API, the more important it is that the API be self-describing.

Fitbit’s API, while not a semantic web application, does illustrate some novel properties that could be used for semantic web APIs, including a more visually rich presentation with more detailed descriptions and suggestions available via tooltips.  These aid the API user, who may have various goals and levels of knowledge relating to the content.

Screenshot of Fitbit API
Screenshot of Fitbit API

Consumer apps

The tools available to ordinary content users to add semantic descriptions have become more plentiful and easier to use.  Ordinary web writers can use Google’s data highlighter to indicate what content elements are about.  Several popular CMS platforms have plug-ins that allow content creators to fill-in forms to describe the content on the page.  These kinds of tools hide the markup from the user, and have been helpful in spurring adoption of semantic markup.

While the creation of semantic content has become popularized, there has not been equivalent progress in developing user-friendly tools that allow audiences to retrieve and explore semantic content. Paige Morgan, an historian who is developing a semantic data set of economic information, notes: “Unfortunately, structuring your data and getting it into a triplestore is only part of the challenge. To query it (which is really the point of working with RDF, and which you need to do in order to make sure that your data structure works), you need to know SPARQL — but SPARQL will return a page of URIs (uniform resource identifiers — which are often in the form of HTML addresses). To get data out of your triplestore in a more user-friendly and readable format, you need to write a script in something like Python or Ruby.  And that still isn’t any sort of graphical user interface for users who aren’t especially tech-savvy.”

We lack consumer-oriented applications that allow people to access and recombine linked data.  There is no user interface for individuals to link themselves to the linked data.  The missing UI reflects a legacy of seeing linked data as being entirely about making content machine-readable.  According to legacy thinking, if people needed to directly interact with the data, they could download it to a spreadsheet.  The term “data” appeals to developers who are comfortable thinking about content structured as databases, but it doesn’t suggest application to things that are mentioned in narrative content.  Most content described by Schema.org is textual content, not numbers, which is what most non-IT people consider as data.  And text exists to be read by people.  But the jargon we are stuck with to discuss semantic content means we emphasize the machine/data side of the equation, rather than the audience/content side of it.

Linked data in reality are linked facts, facts that people can find useful in a variety of situations.  Google Now is ready to use your linked data and tell your customers when they should leave the house.  Google has identified the contextual value to consumers of linked data.  Perhaps your brand should also use linked data in conversations with your customers.  To do this, you need to create consumer facing apps that leverage linked data to empower your customers.

Wolfram Alpha is a well-known consumer app to explore data on general topics that has been collected from various sources.  They characterize their mission, quite appealingly, as “democratizing data.” The app is user friendly, offering query suggestions to help users understand what kinds of information can be retrieved, and refine their queries.  Their solution is not open, however.  According to Wolfram’s Luc Barthelet, “Wolfram|Alpha is not searching the Semantic Web per se. It takes search queries and maps them to an exact semantic understanding of the query, which is then processed against its curated knowledge base.” While more versatile than Google search in the range and detail of information retrieved, it is still a gatekeeper, where individuals are dependent on the information collection decisions of a single company.  Wolfram lacks an open-standards, linked-data foundation, though it does suggest how a consumer-focused application might use of semantic data.

The task of developing an app is more manageable when the app is focused on a specific domain.  The New York Times and other news organizations have been working with linked data for several years to enhance the flexibility of the information they offer.  In 2010 the Times created an “alumni in the news” app that let people track mentions of people according to what university they attended, where the educational information was sourced from DBpedia.

New York Times Linked Data app for alumni in the news.  It relied in part on linked data from Freebase, a Google product that Google is retiring.
New York Times Linked Data app for alumni in the news. It relied in part on linked data from Freebase, a Google product that Google is retiring that will be superseded by Wikidata.

A recent example of a consumer app that is using linked data is a sports-oriented social network called YourSports.  The core metadata of the app is built in JSON-LD, and the app creator is even proposing extensions to Schema.org to describe sports relationships.  This kind of app hides the details of the metadata from the users, and enables them to explore data dimensions as suits their interests.  I don’t have direct experience of this app, but it appears to aggregate and integrate sports-related factual content from different sources.  In doing so, it enhances value for users and content producers.

Screenshot of Yoursports
Screenshot of Yoursports

Opening up content, realizing content value

If your organization is investing in semantic search markup, you should be asking: How else can we leverage this?  Are you using the markup to expose your content in your APIs so other publishers can utilize the content?  Are you considering how to empower potential readers of your content to explore what you have available?  Consumer brands have an opportunity to offer linked data to potential customers through an app that could result in lead generation.  For example, a travel brand could use linked data relating to destinations to encourage trip planning, and eventual booking of transportation, accommodation, and events.  Or an event producer might seed some of its own content to global partners by creating an API experience that leverages the semantic descriptions.

The pace of adoption for aspects of semantic web has been remarkable. But it is easy to overlook what is missing.  A position paper for Schema.org says “Schema.org is designed for extremely mainstream, mass­-market adoption.”  But to consider the mass-market only as publishers acting in their role as customers of search engines is too limiting.  The real mainstream, mass-market is the audience that is consuming the content. These people may not even have used a search engine to reach your content.

Audiences need ways to explore semantically-defined factual content as they please.  It is nice that one can find bits of content through Google, but it would be better if one didn’t have to rely solely only on Google to explore such content.  Yes, Google search is often effective, but search results aren’t really browseable.  Search isn’t designed for browsing: it’s designed to locate specific, known items of information.  Semantic search provides a solution to the issue of too much information: it narrows the pool of results.  Google in particular is geared to offering instant answers, rather than sustaining an engaging content experience.

Linked data is larger than semantic search.  Linked data is designed to discover connections, to see themes worth exploring. Linked data allows brands to juxtapose different kinds of information together that might share a common location or timing, for example. Individuals first need to understand what questions they might be interested in before they are ready for answers to those questions. They start with goals that are hard to define in a search query.  Linked data provides a mechanism to help people explore content that relates to these goals.

While Google knows a lot about many things relating to a person, and people in general, it doesn’t specialize in any one area.  The best brands understand how their customers think about their products and services, and have unique insights into the motivations of people with respect to a specific facet of their lives.  Brands that enable people to interact with linked data, and allow them to make connections and explore possibilities, can provide prospective customers something they can’t get from Google.

— Michael Andrews

Categories
Intelligent Content

Data Types and Data Action

We often think about content from a narrative perspective, and tend to overlook the important roles that data play for content consumers. Specific names or numeric figures often carry the greatest meaning for readers. Such specific factual information is data. It should be described in a way that lets people use the data effectively.

Not all data is equally useful; what matters is our ability to act on data. Some data allows you to do many different things with it, while other data is more limited. The stuff one can do with types of data is sometimes described as the computational affordances of data, or as data affordances.

The concept of affordances comes from the field of ecological psychology, and was popularized by the user experience guru Donald Norman. An affordance is a signal encoded in the appearance of an object that suggests how it can be used and what actions are possible. A door handle may suggest that is should be pushed, pulled or turned, for example. Similarly, with content we need to be able to recognize the characteristics of an item of data, to understand how it can be used.

Data types and affordances

The postal code is an important data type in many countries. Why is it so important? What can you do with a postal code? How people use postal codes provides a good illustration of data affordances in action.

Data affordances can be considered in terms of their purpose-depth, and purpose-scope, according to Luciano Floridi of the Oxford Internet Institute. Purpose-depth relates to how well the data serves its intended purpose. Purpose-scope relates to how readily the data can be repurposed for other uses. Both characteristics influence how we perceive the value of the data.

A postal code is a simplified representation of a location composed of households. Floridi notes that postal codes were developed to optimize the delivery of mail, but subsequently were adopted by other actors for other purposes, such as to allocate public spending, or calculate insurance premiums.

He states: “Ideally, high quality information… is optimally fit for the specific purpose/s for which it is elaborated (purpose–depth) and is also easily re-usable for new purpose/s (purpose–scope). However, as in the case of a tool, sometimes the better [that] some information fits its original purpose, the less likely it seems to be repurposable, and vice versa.” In short, we don’t want data to be too vague or imprecise, and we also want the data to have many ways it can be used.

Imagine if all data were simple text. That would limit what one could do with that data. Defining data types is one way that data can work harder for specific purposes, and become more desirable in various contexts.

A data type determines how an item is formatted and what values are allowed. The concept will be familiar to anyone who works with Excel spreadsheets, and notices how Excel needs to know what kind of value a cell contains.

In computer programming, data types tell a program how to assess and act on variables. Many data types relate to issues of little concern to content strategy, such as various numeric types that impact the speed and precision of calculations. However, there is a rich range of data types that provide useful information and functionality to audiences. People make decisions based on data, and how that data is characterized influences how easily they can make decisions and complete tasks.

Here are some generic data types that can be useful for audiences, each of which has different affordances:

  • Boolean (true or false)
  • Code (showing computer code to a reader, such as within the HTML code tags)
  • Currency (monetary cost or value denominated in a currency)
  • Date
  • Email address
  • Geographic coordinate
  • Number
  • Quantity (a number plus a unit type, such as 25 kilometers)
  • Record (an identifier composed of compound properties, such as 13th president of a country)
  • Telephone number
  • Temperature (similar to quantity)
  • Text – controlled vocabulary (such as the limited ranged of values available in a drop down menu)
  • Text – variable length free text
  • Time duration (number of minutes, not necessarily tied to a specific date)
  • URI or URN (authoritative resource identifier belonging to a specific namespace, such as an ISBN number)
  • URL (webpage)

Not all content management systems will provide structure for these data types out of the box, but most should be supportable with some customization. I have adapted the above list from the listing of data types supported by Semantic MediaWiki, a widely used open source wiki, and the data types common in SQL databases.

By having distinct data types with unique affordances, publishers and audiences can do more with content. The ways people can act on data are many:

  • Filter by relevant criteria: Content might use geolocation data to present a telephone number in the reader’s region
  • Start an action: Readers can click-to-call telephone numbers that conform to an international standard format
  • Sort and rank: Various data types can be used to sort items or rank them
  • Average: When using controlled vocabularies in text, the number of items with a given value can be counted or averaged
  • Sum together: Content containing quantities can be summed: for example, recipe apps allow users to add together common ingredients from different dishes to determine the total amount of an ingredient required for a meal
  • Convert: A temperature can be converted into different units depending on the reader’s preference

The choice of data type should be based on what your organization wants to do with the content, and what your audience might want to do with it. It is possible to reduce most character-based data to either a string or a number, but such simplification will reduce the range of actions possible.

Data verses Metadata

The boundary between data and metadata is often blurry. Data associated with both metadata and the content body-field have important affordances. Metadata and data together describe things mentioned within or about the content. We can act on data in the content itself, as well as act on data within metadata framing the content.

Historically, structural metadata outside the content played a prominent role indicating the organization of the content that implied what the content was about. Increasingly, meaning is being embedded with semantic markup within the content itself, and structural metadata surrounding the content may be limited. A news article may no longer indicate a location in its dateline, but may have the story location marked up within the article that is referenced by content elsewhere.

Administrative metadata, often generated by a computer and traditionally intended for internal use, may have value to audiences. Consider the humble date stamp, indicating when an article was published. By seeing a list of most recent articles, audiences can tell what’s new and what that content is about, without necessarily viewing the content itself.

Van Hooland and Verborgh ask in their recent book on linked data: “[W]here to draw the line between data and metadata. The short answer is you cannot. It is the context of the use which decides whether to considered data as metadata or not. You should also not forget that one of the basic characteristics of metadata: they are ever extensible …you can always add another layer of metadata to describe your metadata.” They point out that annotations, such as reviews of products, become content that can itself be summarized and described by other data. The number of stars a reviewer gives a product, is aggregated with the feedback of other reviewers, to produce an average rating, which is metadata about both the product and the individual reviews on which it is based.

Arguably, the rise of social interaction with nearly all facets of content merits an expansion of metadata concepts. By convention, information standards divide metadata into three categories: structural metadata, administrative metadata and descriptive metadata. But one academic body suggests a fourth type of metadata they call “use metadata,” defined as “metadata collected from or about the users themselves (e.g., user annotations, number of people accessing a particular resource).” Such metadata would blend elements of administrative and descriptive metadata relating to readers, rather than authors.

Open Data and Open Metadata

Open data is another data dimension of interest to content strategy. Often people assume open data refers to numeric data, but it is more helpful to think of open data as the re-use of facts.

Open data offers a rich range of affordances, including the ability to discover and use other people’s data, and the ability to make your data discoverable and available to others. Because of this emphasis on the exchange of data, how this data is described and specified is important. In particular, transparency and use rights issues with open data are a key concern, as administrative metadata in open data is a weakness.

Unfortunately, discussion of open data often focuses on the technical accessibility of data to systems, rather than the utility of data to end-users. There is an emphasis on data formats, but not on vocabularies to describe the data. Open data promotes the use of open formats that are non-proprietary. While important, this focus misses the criticality of having shared understandings of what the data represents.

To the content strategist, the absence of guidelines for metadata standards is a shortcoming in the open data agenda. This problem was recognized in a recent editorial in the Semantic Web Journal entitled “Five Stars of Linked Data Vocabulary Use.” Its authors note: “When working with data providers and software engineers, we often observe that they prefer to have control over their local vocabulary instead of importing a wide variety of (often under-specified, not regularly maintained) external vocabularies.” In other words, because there is not a commonly agreed and used metadata standard, people rely on proprietary ones instead, even when they publish their data openly, which has the effect of limiting the value of that data. They propose a series of criteria to encourage the publication of metadata about vocabulary used to describe data, and the provision of linkages between different vocabularies used.

Classifying Openness

Whether data is truly open depends on how freely available the data is, and whether the metadata vocabulary (markup) used to describe it is transparent. In contrast to the Open Data Five Star frameworks, I view how proprietary the data is as a decisive consideration. Data can be either open or proprietary, and the metadata used to describe the data can be based either on an open or proprietary standard. Not all data that is described as “Open” is in fact non-proprietary.

What is proprietary? For data and metadata, the criteria for what is non-proprietary can be ambiguous, unlike with creative content, where the creative commons framework governs rights for use and modifications. Modification of data and its metadata is of less concern, since such modifications can destroy the re-use value of the content. Practicality of data use and metadata visibility are the central concerns. To untangle various issues, I will present a tentative framework, recognizing that some distinctions are difficult to make. How proprietary data and metadata is often reflects how much control the body responsible for this information exerts. Generally, data and metadata standards that are collectively managed are more open than those managed by a single firm.

Data

We can grade data into three degrees, based on how much control is applied to its use:

  1. Freely available open data
  2. Published but copyrighted data
  3. Selectively disclosed data

Three criteria are relevant:

  1. Is all the data published?
  2. Does a user need to request specific data?
  3. Are there limits on how the data can be used?

If factual data is embedded within other content (for example, using RDFa markup within articles), it is possible that only the data is freely available to re-use, while the contextual content is not freely available to re-use. Factual data cannot be copyrighted in the United States, but may under certain conditions be subject to protection in the EU when a significant investment was made collecting these facts.

Rights management and rights clearance for open data are areas of ongoing (if inconclusive) deliberation among commercial and fee-funded organizations. The BBC is an organization that contributes open data for wider community use, but that generally retains the copyright on their content. More and more organizations are making their data discoverable by adopting open metadata standards, but the extent to which they sanction the re-use of that data for purposes different from it’s original intention is not always clear. In many cases, everyday practices concerning data re-use are evolving ahead of official policies defining what is permitted and not permitted.

Metadata

Metadata is either open or proprietary. Open metadata is when the structure and vocabulary that describes the data is fully published, and is available for anyone to use for their own purposes. The metadata is intended to be a standard that can be used by anyone. Ideally, they have the ability to link their own data using this metadata vocabulary to data sets elsewhere. This ability to link one’s own data distinguishes it from proprietary metadata standards.

Proprietary metadata is one where the schema is not published or is only partially published, or where the metadata restricts a person’s ability to define their own data using the vocabulary.

Examples

Freely Available Open Data

  • With Open Metadata. Open data published using a publicly available, non-proprietary markup. There are many standards organizations that are creating open metadata vocabularies. Examples include public content marked up in Schema.org, and NewsML. These are publicly available standards without restrictions on use. Some standards bodies have closed participation: Google, Yahoo, and Bing decide what vocabulary to include in Schema, for example.
  • With Proprietary Metadata. It may seem odd to publish your data openly but use proprietary markup. However, organizations may choose to use a proprietary markup if they feel a good public one is not available. Non-profit organizations might use OpenCalais, a markup service available for free, which is maintained by Reuters. Much of this markup is based on open standards, but it also uses identifiers that are specific to Reuters.

Published But Copyrighted Data

  • With Open Metadata. This situation is common with organizations that make their content available through a public API. They publish the vocabularies used to describe the data and may use common standards, but they maintain the rights to the content. Anyone wishing to use the content must agree to the terms of use for the content. An example would be NPR’s API.
  • With Proprietary Metadata. Many organizations publish content using proprietary markup to describe their data. This situation encourages web-scraping by others to unlock the data. Sometimes publishers may make their content available through an API, but they retain control over the metadata itself. Amazon’s ASIN product metadata would be an example: other parties must rely on Amazon to supply this number.

Selectively Disclosed Proprietary Data

  • With Open Metadata. Just because a firm uses a data vocabulary that’s been published and is available for others to use, it doesn’t mean that such firms are willing to share their own data. Many firms use metadata standards because it is easier and cheaper to do so, compared with developing their own. In the case of Facebook, they have published their Open Graph schema to encourage others to use it so that content can be read by Facebook applications. But Facebook retains control over the actual data generated by the markup.
  • With Proprietary Metadata. Applies to any situation where firms have limited or no incentive to share data. Customer data is often in this category.

Taking Action on Data

Try to do more with the data in your content. Think about how to enable audiences to take actions on the data, or how to have your systems take actions to spare your audiences unnecessary effort. Data needs to be designed, just like other elements of content. Making this investment will allow your organization to reuse the data in more contexts.

— Michael Andrews