Tag Archives: data journalism

Should Information be Data-Rich or Content-Rich?

One of the most challenging issues in online publishing is how to strike the right balance between content and data.  Publishers of online information, as a matter of habit, tend to favor either a content-centric, or a data-centric approach.  Publishers may hold deep seeded beliefs about what form of information is most valuable.  Some believe that compelling stories will wow audiences. Others expect that new artificial intelligence agents, providing instantaneous answers, will delight them. This emphasis on information delivery can overshadow consideration of what audiences really need to know and do. How information is delivered can get in the way of what the audience needs. Instead of delight, audiences experience apathy and frustration. The information fails to deliver the right balance between facts, and explanation.

The Cultural Divide

Information on the web can take different forms. Perhaps the most fundamental difference is whether online information provides a data-rich or content-rich experience. Each form of experience has its champions, who promote the virtues of data (or content).  Some go further, and dismiss the value of the approach they don’t favor, arguing that content (or data) actually gets in the way of what users want to know.

  • A (arguing for data-richness): Customers don’t want to read all that text!  They just want the facts.  
  • B (arguing for content-richness): Just showing facts and figures will lull customers to sleep!

Which is more important, offering content or data?  Do users want explanations and interpretations, or do they just want the cold hard facts?  Perhaps it depends on the situation, you think.  Think of a situation where people need information.  Do they want to read an explanation and get advice, or do they want a quick unambiguous answer that doesn’t involve reading (or listening to a talking head)?  The scenario you have in mind, and how you imagine people’s needs in that scenario, probably reveals something about your own preferences and values.  Do you like to compare data when making decisions, or do you like to consider commentary?  Do your own PowerPoint slides show words and images, or do they show numbers and graphs? Did you study a content-centric discipline such as the humanities in university, or did you study a data-centric one such as commerce or engineering? What are your own definitions of what’s helpful or boring?

Our attitudes toward content and data reflect how we value different forms of information.  Some people favor more generalized and interpreted information, and others prefer specific and concrete information.  Different people structure information in different ways, through stories for example, or by using clearly defined criteria to evaluate and categorize information.  These differences may exist within your target audience, just as they may show up within the web team trying to deliver the right information to that audience.  People vary in their preferences. Individuals may shift their personal  preferences depending on topic or situation.  What form of information audiences will find most helpful can elude simple explanations.

Content and data have an awkward relationship. Each seems to involve a distinct mode of understanding.  Each can seem to interrupt the message of the other. When relying on a single mode of information, publishers risk either over-communicating, or under-communicating.

Content and Data in Silhouette

To keep things simple (and avoid conceptual hairsplitting), let’s think about data as any values that are described with an attribute.  We can consider data as facts about something.  Data can be any kind of fact about a thing; it doesn’t need to be a number. Whether text or numeric, data are values that can be counted.

Content can involve many distinct types, but for simplicity, we’ll consider content as articles and videos — containers  where words and images combine to express ideas, stories, instructions, and arguments.

Both data and content can inform.  Content has the power to persuade, as sometimes data can possess that power as well.  So what is the essential difference between them?  Each has distinct limitations.

The Limits of Content

In certain situations content can get in the way of solving user problems.  Many times people are in a hurry, and want to get a fact as quickly as possible.  Presenting data directly to audiences doesn’t always mean people get their questioned answered instantly, of course.  Some databases are lousy answering questions for ordinary people who don’t use databases often.  But a growing range of applications now provide “instant answers” to user queries by relying on data and computational power.  Whereas content is a linear experience, requiring time to read, view or listen, data promises instant experience that can gratify immediately.  After all, who wants to waste their customer’s time?  Content strategy has long advocated solving audience problems as quickly as possible.  Can data obviate the need for linear content?

“When you think about something and don’t really know much about it, you will automatically get information.  Eventually you’ll have an implant where if you think about a fact, it will just tell you the answer.”  Google’s Larry Page, in Steven Levy’s  “In the Plex”.

The argument that users don’t need websites (and their content) is advanced by SEO expert Aaron Bradley in his article “Zero Blue Links: Search After Traffic”.   Aaron asks us to “imagine a world in which there was still an internet, but no websites. A world in which you could still look for and find information, but not by clicking from web page to web page in a browser.”

Aaron notes that within Google search results, increasingly it is “data that’s being provided, rather than a document summary.”  Audiences can see a list of product specs, rather than a few sentences that discuss those specs. He sees this as the future of how audiences will access information on different devices.  “Users of search engines will increasingly be the owners of smart phones and smart watches and smart automobiles and smart TVs, and will come to expect seamless, connected, data-rich internet experiences that have nothing whatsoever to do with making website visits.”

In Aaron’s view, we are seeing a movement from “documents to data” on the web. “The evolution of search results in terms of the gradual supplanting of document references by data than it is to infer that direction through the enumeration of individual features.”  No need to read a document: search results will answer the question.  It’s an appealing notion, and one that is becoming more commonplace.  Content isn’t always necessary if clear, unambiguous data is available that can answer the question.

Google, or any search engine, is just a channel — an important one for sure, but not the end-all and be-all.  Search engines locate information created by others, but unless they have rights to that information, they are limited in what they do with it. Yet the principles here can apply to other kinds of interactive apps, channels and platforms that let users get information instantly, without wading through articles or videos.  So is content now obsolete?

There is an important limitation to considering SEO search results as data.  Even though the SEO community refers to search metadata as “structured data”, the use of this term is highly misleading.  The values described by the metadata aren’t true data that can be counted.  They are values to display, or are links to other values.  The problem with structured data as currently practiced is that is doesn’t enforce how the values need to be described.  The structured data values are never validated, so computers can’t be sure if two prices appearing on two random websites are both quoting the same currency, even if both mention dollars.  SEO structured data rarely requires controlled vocabulary for text values, and most of its values doesn’t include or mandate data typing that computers would need to aggregate and compare different values.  Publishers are free to use most any kind of text value they like in many situations.   The reality of SEO structured data is less glamorous than it’s image: much of the information described by SEO structured data is display content for humans to read, rather than data for machines to transform.  The customers who scan Google’s search results are people, not machines.  People still need to evaluate the information, and decide its credibility and relevance.  The values aren’t precise and reliable enough for computers to make such judgements.

When an individual wants to know what time a shop closes, it’s a no brainer to provide exactly that information, and no more. The strongest cases for presenting data directly is when the user already knows exactly what they want to know, and they will understand the meaning and significance of the data shown.  These are the “known unknowns” (or “knowns but forgotten”) use cases.  Plenty of such cases exist.  But while the lure of instant gratification is strong, people aren’t always in a rush to get answers, and in many cases they shouldn’t be in a rush, because the question is bigger than a single answer can address.

The Limits of Data

Data in various circumstances can get in the way of what interests audiences.  At a time when the corporate world increasingly extols the virtues of data, it’s important to recognize when data can be useless, because it doesn’t answer questions that audiences have.  Publishers should identify when data is oversold, as always being what audiences want.  Unless data reflects audiences priorities, the data is junk as far as audiences are concerned.

Data can bring credibility to content, though has the potential to confuse and mislead as well.  Audiences can be blinded by data when it is hard to comprehend, or is too voluminous. Audiences need to be interested in the data for it to provide them with value.  Much of the initial enthusiasm for data journalism, the idea of writing stories based on the detailed analysis of facts and statistics, has receded.  Some stories have been of high quality, but many weren’t intrinsically interesting to large numbers of viewers.  Audiences didn’t necessarily see themselves in the minutiae, or feel compelled to interact with raw material being offered to them.  Data journalism stories are different from commercially oriented information, which have well defined use cases specifying how people will interact with data.  Data journalism can presume people will be interested in topics simply because public data on these topics is available.  However, this data may be collected for a different purpose, often for technical specialists.  Presenting it doesn’t transform it into something interesting to audiences.

The experience of data journalism shows that not all data is intrinsically interesting or useful to audiences.  But some technologists believe that making endless volumes of data available is intrinsically worthwhile, because machines have the power to unlock value from the data that can’t be anticipated.

The notion that “data is God” has fueled the development of the semantic web approach, which has subsequently been  rebranded as “linked data”.  The semantic web has promised many things, including giving audiences direct access to information without the extraneous baggage of content.  It even promised to make audiences irrelevant in many cases, by handing over data to machines to act on, so that audiences don’t even need to view that data.  In its extreme articulation, the semantic web/linked data vision considers content as irrelevant, and even audiences as irrelevant.

These ideas, while still alive and championed by their supporters, have largely failed to live up to expectations.  There are many reasons for this failure, but a key one has been that proponents of linked data have failed to articulate its value to publishers and audiences. The goal of linked data always seems to be to feed more data to the machine.  Linked data discussions get trapped in the mechanics of what’s best for machines (de-referencable URIs,  machine values that mean nothing to humans), instead of what’s useful for people.

The emergence of schema.org (the structured data standard used in SEO) represents a step back from such machine-centric thinking, to accommodate at least some of the needs of human metadata creators by allowing text values. But schema.org still doesn’t offer much in the way of controlled vocabularies for values, which would be both machine-reliable and human-friendly.  It only offers a narrow list of specialized “enumerations”, some of which are not easy-to-read text values.

Schema.org has lots of potential, but its current capabilities get over-hyped by some in the SEO community.  Just as schema.org metadata should not be considered structured data, it is not really the semantic web either.  It’s unable to make inferences, which was a key promise of the semantic web.  Its limitations show why content remains important. Google’s answer to the problem of how to make structured data relevant to people was the rich snippet.  Rich snippets displayed in Google search results are essentially a vanity statement. Sometimes these snippets answer the question, but other times they simply tease the user with related information.  Publishers and audiences alike may enjoy seeing an extract of content in search results, and certainly rich snippets are a positive development in search. But displaying extracts of information does not represent an achievement of the power of data.  A list of answers supplied by rich snippets is far less definitive than a list of answers supplied by a conventional structured query database — an approach that has been around for over three decades.

The value of data comes from its capacity to aggregate, manipulate and compare information relating to many items.  Data can be impactful when arranged and processed in ways that change an audience’s perception and understanding of a topic. Genuine data provides values that can be counted and transformed, something that schema.org doesn’t support very robustly, as previously mentioned.  Google’s snippets, when parsing metadata values from articles, simply display fragments  from individual items of content.  A list of snippets doesn’t really federate information from multiple sources into a unified, consolidated answer.  If you ask Google what store sells the cheapest milk in your city, Google can’t directly answer that question, because that information is not available as data that can be compared.  Information retrieval (locating information) is not the same as data processing (consolidating information).

“What is the point of all that data? A large data set is a product like any other. It must be maintained and updated, given attention. What are we to make of it?”  Paul Ford in “Usable Data

But let’s assume that we do have solid data that machines can process without difficulty.  Can that data provide audiences with what they need?  Is content unnecessary when the data is machine quality?  Some evidence suggests that even the highest quality linked data isn’t sufficient to interest audiences.

The museum sector has been interested in linked data for many years.  Unlike most web publishers, they haven’t been guided by schema.org and Google.  They’ve been developing their own metadata standards.  Yet this project has had its problems.  The data lead of a well known art museum complained recently of the “fetishization of Linked Open Data (LOD)”.  Many museums approached data as something intrinsically valuable, without thinking through who would use the data, and why.  Museums reasoned that they have lots of great content (their collections) and that they needed to provide information about their collections online to everyone, so that linked data was the way to do that.  But the author notes: ‘“I can’t wait to see what people do with our data” is not a clear ROI.’  When data is considered as the goal, instead of as a means to a goal, then audiences get left out of the picture.  This situation is common to many linked data projects, where getting data into a linked data structure becomes an all consuming end, without anchoring the project in audience and business needs.  For linked data to be useful, it needs to address specific use cases for people relying on the data.

Much magical thinking about linked data involves two assumptions: that the data will answer burning questions audiences have, and these answers will be sufficient to make explanatory content unnecessary.  When combined, these assumptions become one: everything you could possibly want to know is now available as a knowledge graph.

The promise that data can answer any question is animating development of knowledge graphs and “intelligent assistants” by nearly every big tech company: Google, Bing, LinkedIn, Apple, Facebook, etc.  This latest wave of data enthusiasm again raises questions whether content is becoming less relevant.

Knowledge graphs are a special form of linked data.  Instead of the data living in many places, hosted by many different publishers, the data is instead consolidated into a single source curated by one firm, for example, Bing. A knowledge graph combines millions of facts about all kinds of things into a single data set. A knowledge graph creator generally relies on other publisher’s linked data. But it assumes responsibility for validating that data itself when incorporating the information in its knowledge graph.  In principle, the information is more reliable, both factually and technically.

Knowledge graphs work best for persistent data (the birth year of a celebrity) but less well for high velocity data that can change frequently (the humidity right now).   Knowledge graphs can be incredibly powerful.  They can allow people to find connections between pieces of data that might not seem related, but are.  Sometimes these connections are simply fun trivia (two famous people born in the same hospital on the same day). Other times these connections are significant as actionable information.  Because knowledge graphs hold so much potential, it is often difficult to know how they can be used effectively.   Many knowledge graph use cases relate to open ended exploration, instead of specific tasks that solve well defined user problems.   Few people can offer a succinct, universally relevant reply to the question: “What problem does a knowledge graph solve?” Most of the success I’ve seen for knowledge graphs has been in specialized vertical applications aimed at researchers, such as biomedical research or financial fraud investigations.  To be useful to general audiences, knowledge graphs require editorial decisions that queue up on-topic questions, and return information relevant to audience needs and interests.  Knowledge graphs are less useful when they simply provide a dump of information that’s related to a topic.

Knowledge graphs combine aspects of Wikipedia (the crowdsourcing of data) with aspects of a proprietary gatekeeping platform such as Facebook (the centralized control of access to and prioritization of information).  No one party can be expected to develop all the data needed in a knowledge graph, yet one party needs to own the graph to make it work consistently — something that doesn’t always happen with linked data.   The host of the knowledge graph enjoys a privileged position: others must supply data, but have no guarantee of what they receive in return.

Under this arrangement, suppliers of data to a knowledge graph can’t calculate their ROI. Publishers are back in the situation where they must take a leap of faith that they’ll benefit from their effort.  Publishers are asked to supply data to a service on the basis of a vague promise that the service will provide their customers with helpful answers.  Exactly how the service will use the data is often not transparent. Knowledge graphs don’t reveal what data gets used, and when.   Publisher also know their rivals are also supplying data to the same graph.  The faith-based approach to developing data, in hopes that it will be used, has a poor track record.

The context of data retrieved from a knowledge graph may not be clear.  Google, Siri, Cortana, or Alexa may provide an answer.  But on what basis do they make that judgment?  The need for context to understand the meaning of data leads us back to content.   What a fact means may not be self-evident. Even facts that seem straightforward can depend on qualified definitions.

“A dataset precise enough for one purpose may not be sufficiently precise for another. Data on the Web may be wrong, or wrong in some context—with or without intent.” Bernstein, Hendler & Noy

The interaction between content and data is becoming even more consequential as the tech industry promotes services incorporating artificial intelligence.  In his book Free Speech, Timothy Garton Ash shared his experience using WolfamAlpha, a semantic AI platform that competes with IBM Watson, and that boldly claims to make the “world’s knowledge computable.”  When Ash asked WolfamAlpha “How free should speech be?”, it replied: “WolframAlpha doesn’t understand your query.”   This kind of result is entirely expected, but it is worth exploring why something billed as being smart fails to understand.  Conversational interfaces, after all, are promising to answer our questions.  Data needs to exist for questions to get answers.  For data to operate independently of content, an answer must be expressible as data. But many answers can’t be reduced to one or two values.  Sometimes they involve many values.  Sometimes answers can’t be expressed as a data value at all. This actuality means that content will always be necessary for some answers.

Data as a Bridge to Content

Data and content have different temperaments.  The role of content is often to lead the audience to reveal what’s interesting.  The role of data is frequently to follow the audience as they indicate their interests. Content and data play complementary roles.  Each can be incomplete without the other.

Content, whether articles, video or audio, is typically linear.  Content is meant to be consumed in a prescribed order.   Stories have beginnings and ends, and procedures normally have fixed sequences of steps.  Hyperlinking content provides a partial solution to making a content experience less linear, when that is desired.  Linear experiences can be helpful when audiences need orientation, but they are constraining when such orientation isn’t necessary.

Data, to be seen, must first be selected. Publishers must select what data to highlight, or they must delegate that task to the audience. Data is non-linear: it can be approached in any order.  It can be highly interactive, providing audiences with the ability to navigate and explore the information in any order, and change the focus of the information.  With that freedom comes the possibility that audiences get lost, unable to identify information of value.  What data means is highly dependent on the audience’s previous understanding.  Data can be explained with other data, but even these explanations require prior  knowledge.

From an audience perspective, data plays various roles.  Sometimes data is an answer, and the end of a task.  Sometimes data is the start of a larger activity.  Data is sometimes a signal that a topic should be looked at more closely.  Few people decide to see a movie based on an average rating alone.  A high rating might prompt someone to read about the film.  Or the person may be already be interested in reading about the film, and consults the average rating simply to confirm their own expectation of whether they’d like it.  Data can be an entryway into a topic, and a point of comparison for audiences.

Writers can undervalue data because they want to start with the story they wish to tell, rather than the question or fact that prompts initial interest from the audience.   Audiences often begin exploration by seeking out a fact. But what that fact may be will be different according to each individual.  Content needs facts to be discovered.

Data evangelists can undervalue content because they focus on the simple use cases, and ignore the messier ones.  Data can answer questions only in some situations.  In an ideal world, a list of questions and answers get paired together as data. Just match the right data with the right question.  But audiences may find it difficult to articulate the right question, or they may not know what question to ask. Audiences may find they need to ask so many specific questions to develop a broad understanding.  They may find the process of asking questions exhausting.  Search engines and intelligent agents aren’t going to Socratically enlighten us about new or unfamiliar topics.  Content is needed.

Ultimately, whether data or content is most important depends on how much communication is needed to support the audience.  Data supplies answers, but doesn’t communicate ideas.  Content communicates ideas, but can fail to answer if it lacks specific details (data) that audiences expect.

No bold line divides data from content.  Even basic information, such as expressing how to do something, can be approached either episodically as content, or atomically as data.  Publishers can present the minimal facts necessary to perform a task (the must do’s), or they can provide a story about possibilities of tasks to do (the may do’s).  How should they make that decision?

In my experience, publishers rarely create two radically alternative versions of online information, a data-centric and content-centric version, and test these against each other to see which better meets audience needs.  Such an approach could help publishers understand what the balance between content and data needs to be.  It could help them understand how much communication is required, so the information they provide is never in the way of the audience’s goals.

— Michael Andrews

Content Strategy Innovation: Emerging Practices

What new practices will forward-looking publishers start to implement in the next few years? Digital content is in a constant state of change. Are current practices up to the task?

Various professions are actively developing new computer-based practices to address high volume content. Journalists are under pressure to produce greater quantities of content with fewer resources, and to make this content even more relevant. Organizations focused on vast quantities of historical content, such as museums and scholars, have been developing new approaches to extract value from all this material. These practices may not be ones that content strategists are familiar with, but should be.

Ten years ago, online content was largely about web pages. Today it includes mobile apps, tablets, even self published ebooks that live in the cloud, and new channels are around the corner. Even though we now accept that the channels for content are always changing, we still consider content as primarily the responsibility of an individual author. We should expand our thinking to include ways to use computer-augmented authoring and analysis.

It may seem hasty to talk about new practices, when many organizations struggle to implement proven good practices. As powerful as current content strategy practices are, they do not address many important issues organizations face with their content. It’s essential to develop new practices, not just advocate well-established ones. It is complacent to dismiss change by believing that the future cannot be predicted, so we can worry about it when it arrives.

We shouldn’t be defined by current tools and short-term thinking. As Jonathon Colman recently wrote in CCO magazine: “What I fear about our future, however, is that we get so caught up with the technologies, tools and tactics of our trade that we reassign our thinking from the long term to the short. We start thinking and strategizing in ever shorter cycles: months instead of years, campaigns instead of life cycles, individual infographics instead of brands they represent.”

Fortunately, content strategy can draw upon the deep experience of other disciplines concerned with content. To quote William Gibson: “The future is already here — it’s just not very evenly distributed.” I want to highlight some promising approaches being developed by colleagues in other fields.

The pressures for innovation

The pressures for innovation in content strategy come from audiences, and from within organizations. Audience expectations show no sign of diminishing — consumers everywhere are becoming more demanding. They are fickle and individualistic, and don’t want canned servings of content. They desire diversity in content at the same time they complain of too much information (TMI). They expect personalization but don’t want to relinquish control. They want to be enthusiastic about what they view, but can easily react with skepticism and impatience.

Organizations of all kinds are struggling to get their content affairs in order. They are trying to bring process and predictability to the creation and delivery of their content. Much of this effort focuses on people and processes. But approaches that are primarily labor-intensive will not ultimately provide the capability to satisfy escalating customer demands.

Future ready: beyond structure and modularity

Content strategy recommends being future-ready. Generally this means applying structure and modularity to one’s content, so it can be ready for whatever new channel emerges. While these concepts are still not widely implemented, the concepts themselves are already old, having been a recommended best practice since the early 2000s (see for example, the first edition of Anne Rockley’s Managing Enterprise Content, published in 2002). Adoption of structure and modularity has been slow to take hold due to the immaturity of standards and tools. But it does seem that structure and modularity is now crossing the chasm from being a specialized technical communications practice towards mainstream acceptance. While it can be easy to become preoccupied by the implementation of current practices, content strategy shouldn’t stop thinking about what new practices are needed.

The content must be future-ready: able to adapt to future requirements. Equally importantly, one’s content strategy must be future-forward, anticipating these requirements, not just reacting to them. When discussing the value of “intelligent content,” the content strategy discipline has largely focused on one part of the equation: the markup of content, and how rules should govern what content is displayed. It has generally avoided more algorithmic issues. To realize the full possibilities of intelligent content, content strategy will need to move beyond markup and into the areas of queries and text and data analysis. These areas are rich with possibilities to add value for audiences, and enable brands to offer better experiences.

Emerging practices

Content strategy can learn much from other content-intensive professions, especially developments coming from certain areas of journalism (data journalism and algorithmic journalism), the cultural sector (known as GLAMs), and computer-oriented humanities research (digital humanities).

These disciplines offer four approaches that could help various organizations with their content strategy:

  1. Data as Content
  2. Bespoke content
  3. Semantic curation
  4. Awareness of meaning

Data as Content

Savvy journalists are aware that there can be engaging stories hidden in data. Data is solid and concrete compared to anecdotes. Data can be visual and interactive. Data is happening all the time: the story it tells is alive, always changing. The fascination of data is evident in the growing trend to monitor and track one’s own data: the so-called quantified self. We gain a perspective on our exercise or eating we might not otherwise see. The possibility for content strategy is to look not just at “me data” but also “we data”: data about our community. There are numerous quality of life indicators relating to communities we identify with. We already track data about communities of interest: the performance of our favorite sports team, or the rankings of the university we attended. But data can provide stories about much more.

Data journalists think about sources of data as potential story material. How do the property values of our local neighborhood compare with other neighborhoods? If you adjust these findings for the quality of schools, or average commute time, how does it compare then? Journalists curate interesting data, and think of ways to present it that is interesting to audiences. Audiences can query the data to find exactly what interest them.

Brands can adopt the techniques of data journalism, and use data as the basis of content. Brands can tell the story of you, the customer. For example, looking at their data, what do they notice about changes in customer needs and preferences? People are often interested in how their perspectives and behavior compare with others. They want insights into emerging trends. By offering visual data that can be explored thematically, customers can understand more, and deepen their relationship to a brand. The aggregation of different kinds of customer data (even what colors are most popular in what parts of the country) can provide an interesting way to tie together an egocentric angle (reader as protagonist) with a brand centric story (what the brand does to serve the customer). Data about such attributes can humanize activities than might otherwise appear opaque.

I can imagine data storytelling being used in B2B content marketing, where demonstrating engagement is a pressing need. There are opportunities to provide customers with useful insights, by sharing data about order and servicing trends for product categories. Providing data about the sentiment of fellow customers can strengthen one’s identification as a customer of the brand. Obviously this information would need to be anonymized, and not disclose proprietary data.

Bespoke Content

Bespoke content represents the ultimate goal of personalization. It is content made to order: for a person, or to fit a specific moment in time. The tools to create bespoke content are emerging from another area of journalism: robot journalism.

In robot journalism, software takes on writing tasks. Where data journalism uses data to tell stories with interactive charts and tables, robot journalism writes stories algorithmically from data. The notion that computers might write content may be hard to accept. Many content strategists come from a background in writing, and may equate writing quality with writing style. But when we view writing through the lens of audience value, relevance is the most important factor. Robot journalism can provide highly customized and personalized content.

Organizations such as the Associated Press are using robot journalism to write brief stories about sports, weather and financial news.

The process behind algorithmic writing involves:

  1. Take in data related to a topic
  2. Compute what is “newsworthy” about that data
  3. Decide how to characterize the significance of an event
  4. Place event in context of specific interests of an audience segment
  5. Convert information into narrative text

Good candidates for robot journalism are topics involving status-based, customer-specific information that is best presented in a narrative form.  A simple example of an algorithmically authored narrative using customer and brand data might be as follows:
“Your [car model] was last service on [date] by [dealer]. Driving in your region involves higher than average [behavior: e.g., stop and go traffic} that can accelerate wear on {function: e.g., brakes}. According to your driving history, we recommend you service [function] by [this date]. It will cost [$]. Available times are: [dates] at [nearest location].”

Although conditional content has been used in DITA-described technical communications for some time, robot journalism takes conditional content a couple steps further by incorporating live data, and by auto-creating the sentence clauses used in narrative descriptions, rather than simply substituting a limited number of text variables such as a product model name.

The approach can also be used for micro-segments, such as product loyalists who have bought three or more of a product over the past twelve months. A short narrative could be constructed to share the significance of something newsworthy relating to the product. A wine enthusiast might get a short narrative forecasting the quality of the newest vintage for a region she enjoys wine from.

Writing such bespoke narratives manually would be prohibitively expensive. Robot journalism approaches will enable brands to offer customized and personalized narrative content in a cost-effective way and at a large scale.

Semantic Curation

Today multiple issues hinder content curation. Some curation is done well, but is labor intensive, so is done on a limited scale that only touches a small portion of content. Attempts to automate curation are often clumsy. Much curation today is reactive to popularity, rather than choosing what’s significant in some specific way. We end up with lists of “top,” “favorite” or “trending” items that don’t have much meaning to audiences: they seem rather arbitrary, and are often predictable.

True curation aides discovery of content not known to a reader that reflects their individual interests. Semantic curation empowers individuals to find the best content that matches their interests. By semantic, I mean using linked data. And leading the way in developing semantic curation is a community with deep experience in curation: galleries, libraries, archives, and museums (GLAM).

GLAMs have been pioneers developing metadata, and as a result, have been some of the first to experience the pain of locked up metadata. Despite the richness of their descriptions of content, these descriptions didn’t match the descriptions developed by others. It is hard to pair together the content from different sources when their metadata descriptions don’t match. So GLAMs have turned to linked open data to describe their content. It is opening up a new world of curation.

The development of open cultural data is a significant departure from proprietary formats for metadata. When all cultural institutions describe their content holdings in the same way, it becomes possible to find connections between related items that are in different places. For GLAMs, it is opening access to digital collections. For audiences, it enables bottom up curation. Individuals can express what kind of content they are interested in, and find this content regardless of what source has the content. Unlike with a search engine, the seeker of content can be very specific. They may seek paintings by artists from a certain country who depicted women during a certain time period. No matter what physical collection such painting belong to, the content seeker can access the content. They can access any content, not just a small set of content selected by a curator.

The potential to expand such interest-driven, bottom-up curation beyond the cultural sector is enormous. While the work involved in creating open metadata standards is far from trivial, significant progress is being achieved to describe all kinds of content in a linked manner. The BBC has been exemplary in providing content curated using linked data on topics from animals to sports.

Awareness of Meaning

Content analytics today are not very smart. They show activity, but tell us little about the meaning of content. We can track content by the section on a website where it appears, the broad topic it is classified under, or perhaps the page title, but not by what specifically is discussed in an article. When we don’t understand what our content is actually about, what it says specifically, it is hard to know how it is performing.

This problem is well known to people working with social media content. It helps little to know that people are discussing an article. It is far more important to know what precisely they are saying about it.

As Hemann and Burbary note in their recent book, Digital Marketing Analytics: “There is not currently any pieces of marketing analytics software that can do as good job as a human at… classifying the social data collected into meaningful information.” People must manually apply tags to social content in their social listening tool for later analysis. This is labor intensive, and often means that only some of the content gets analyzed. The problem is largely the same for brand created content: CMSs don’t generate tags automatically based on the meaning of the text, so tagging must be done manually, and is often not very specific.

Again, the innovation is coming from outside the disciplines of content management and marketing. Scholars working in the field of digital humanities (DH) have been working at ways to query and tag large bodies of textual content to enable deeper analysis. Some the techniques are quite sophisticated, and rely on widely available open source tools. It is surprising these techniques haven’t been applied more frequently to consumer content.
DH techniques examine large sets of digital content to learn what these sets are about, without actually reading the content. Perhaps the most famous example of such techniques is Google’s Ngram Viewer, which can find the frequency of different phrases over time in books to learn what idioms are popular, or how famous different people are over time. (You can learn about the origins and applications of Ngram Viewer in the book Uncharted.)

Employing diverse methods, the techniques are often referred to as text analytics. Two leading approaches to text analytics are topic modeling, and corpus linguistics. Topic modeling allows users to find themes in large bodies of text, by identifying key nouns that when discussed together signal the presence of a specific topic. Corpus linguistics can identify phrases that are significant, that are used more frequently than would be expected.

Text analytics can be useful for many content activities. It can be used in content auditing, to learn what specific topics has a brand been publishing about, or to learn more about how the brand’s voice is appearing in the actual content. These same approaches can be used for social media analysis. Topic modeling also can be used to auto categorize content for audiences, to provide audiences with richer and more detailed navigation.

A complex machine is not necessarily an intelligent one.  (author photo)
A complex machine is not necessarily an intelligent one. (author photo)

The Opportunities Ahead

This quick tour of emerging practices suggests that it is possible to apply a more algorithmic approach to content to improve the audience experience. Unfortunately, I see few signs that CMS vendors are focused on these opportunities. They seem beholden to the existing paradigm of content management, where individual writers are responsible for curating, tagging and producing nearly all content. It’s an approach that doesn’t scale readily, and severely limits an organization’s capacity to deliver content that’s tailored to the interests of audiences.

It is a mistake to assume that greater use of technology necessarily results in greater complexity for authors. Some new practices need to be performed by specialists, rather than foisted on non-specialist authors who already are busy. When implemented properly, with a user-centric design, new practices should reduce the amount of manual labor required of authors, so they can focus on the creative aspects of content that machines are not able to do. As the value of content becomes understood, organizations will realize they face a productivity bottleneck, where it becomes difficult to deliver sophisticated content they aspire to with existing staff levels. The most successful publishers will be ones that adopt new practices that deliver more value without needing to add to their headcount.

Noz Urbina notes the importance of planning for change early if organizations hope to adapt to market changes. “I fear communicators are in a vicious cycle today. As the change in our market accelerates, the longer we avoid taking on revolutionary changes in search of simple short-term incremental changes, the bigger our long-term risk. Short term simple can be medium-long term awful. The risk increases with every delay that in 2 years’ time, management or the market will push us to deliver something in a matter of months that would have needed a 3-7 year transition process to prepare for. This is a current reality for many organisations for whom I have worked.”

The best approach is to learn about practices that are on the horizon, and to think about how they might be useful to your organization. Consider a small scale project to experiment and pilot an approach to learn more what’s involved, and what benefits it might offer. Very small teams are doing many interesting content innovations, often as a side project.

—Michael Andrews