Tag: metadata

Tailless Content Management

Post author By Michael Andrews
Post date April 17, 2019

There’s a approach to content management that is being used, but doesn’t seem to have a name. Because it lacks a name, it doesn’t get much attention. I’m calling this approach tailless content management — in contrast to headless content management. The tailless approach, and the headless approach, are trying to solve different problems.

What Headless Doesn’t Do

Discussion of content management these days is dominated by headless CMSs. A crop of new companies offer headless solutions, and legacy CMS vendors are also singing the praises of headless. Sitecore says: “Headless CMSs mean marketers and developers can build amazing content today, and—importantly—future-proof their content operation to deliver consistently great content everywhere.”

In simple terms, a headless CMS strips functionality relating to how web pages are presented and delivered to audiences. It’s supposed to let publishers focus on what the content says, rather than what it looks like when delivered. Headless CMS is one of several trends to unbundle functionality customarily associated with CMSs. Another trend is moving the authoring and workflow functionality into a separate application that is friendlier to use. CMS vendors have long touted that their products can do everything needed to manage the publication of content. But increasingly content authors and designers are deciding that vendor choices are restricting, rather than helpful. CMSs have been too greedy in making decisions about how content gets managed.

“Future-proof” headless CMSs may seem like the final chapter in the evolution of the CMS. But even headless CMSs can still be very rigid in how they handle content elements. Many are based on the same technology stack (LAMP) that’s obliquely been causing problems for publishers over the past two decades. In nearly every CMS, all audience-facing factual information needs to be described as a field that’s attached to a specific content type. The CMS may allow some degree of content structuring, and the ability to mix different fragments of content in different ways. But they don’t solve important problems that complex publishers face: the ability to select and optimize alternative content-variables, to use data-variables across different content, and to create dynamic content-variables incorporating data-variables. To my mind, those three dimensions are the foundation for what a general-purpose approach to content engineering must offer. Headless solutions relegate the CMS to being an administrative interface for the content. The CMS is a destination to enter text. But it often does a poor job supporting editorial decisions, and giving publishers true flexibility. The CMS design imposes restrictions on how content is constructed.

Since the CMS no longer worries about the “head”, headless solutions help publishers focus on the body. But the solution doesn’t help publishers deal with a neglected aspect: the content’s tail.

Content’s ‘Tail’

Humans are one of the few animals without tails. Perhaps that’s why we don’t tend to talk about the tail as it relates to content. We sometimes talk about the “long tail” of information people are looking for. That’s about as close as most discussions get to considering the granular details that appear within content. The long tail is a statistical metaphor, not a zoological one.

Let’s think about content management as having three aspects: the head at the top (and which is top of mind for most content creators), the body in the middle (which has received more attention lately), and the tail at the end, which few people think much about.

The head/body distinction in content is well-established. The metaphor needs to be extended to include the notion of a tail. Let’s breakdown the metaphor:

The head — is the face of the content, as presented to audiences.

The body — are the organs (components) of the content. Like the components of the human body (heart, lungs, stomach, etc.) the components within the body of content each should have a particular function to play.

The tail — are the details in the content (mnemonic: deTails). The tail provides stability, keeping the body in balance

In animals, tails play an important role negotiating with their surroundings. Tails offer balance. They swat flies. They can grab branches to steady oneself. Tails help the body adjust to the environment. To do this, tails need to be flexible.

Details can be the most important part of content, just as the tails of some animals are main event. In a park a kilometer from my home in central India, I can watch dozens of peacocks, India’s national bird. Peacocks show us that tails are not minor details.

When the tail is treated as a secondary aspect of the body, its role gets diminished. Publishers need to treat data as being just as important as content in the body. Content management needs to consider both customer-facing data and narrative content as distinct but equally important dimensions. Data should not be a mere appendage to content. Data has value in its own right.

With tailless content management, customer-facing data is stored separately from the content using the data.

The Body and the Details

The distinction between content and data, and between the body and the detail, can be hard to grasp. The architecture of most CMSs doesn’t make this distinction, so the difference doesn’t seem to exist.

CMSs typically structure content around database fields. Each field has a label and an associated value. Everything that the CMS application needs to know gets stored in this database. This model emerged when developers realized that HTML pages had regular features and structures, such as having titles and so on. Databases made managing repetitive elements much easier compared to creating each HTML page individually.

The problem is that a single database is trying to many different things at once. It can be:

Holding long “rich” texts that are in the body of an article

Holding many internally-used administrative details relating to articles, such as who last revised an article

Holding certain audience-facing data, such as the membership services contact telephone number and dates for events

These fields have different roles, and look and behave differently. Throwing them together in a single database creates complexity. Because of the complexity, developers are reluctant to add additional structure to how content is managed. Authors and publishers are told they need to be flexible about what they want, because the central relational database can’t be flexible. What the CMS offers should be good enough for most people. After all, all CMSs look and behave the same, so it’s inevitable that content management works this way.

Something perverse happens in this arrangement. Instead of the publisher structuring the content so it will meet the publisher’s needs, the CMS’s design ends up making decisions about if and how content can be structured.

Most CMSs are attached to a relational database such as mySQL. These databases are a “kitchen sink” holding any material that the CMS may need to perform its tasks.

To a CMS, everything is a field. They don’t distinguish between long text fields that contain paragraphs or narrative content that has limited reuse (such as a teaser or the article body) from data fields with simple values that are relevant across different content items and even outside of the content. CMSs mix narrative content, administrative data, and editorial data all together.

A CMS database holds administrative profile information related to each content item (IDs, creation dates, topic tags, etc). The same database is also storing other non-customer facing information that’s more generally administrative such as roles and permission. In addition to the narrative content and the administrative profile information, the CMS stores customer-facing data that’s not necessarily linked to specific content items. This is information about entities such as products, addresses of offices, event schedules and other details that can be used in many different content items. Even though entity-focused data can be useful for many kinds of content, these details are often fields of specific content types.

The design of CMSs reflects various assumptions and priorities. While everything is a field, some fields are more important than others. CMSs are optimized to store text, not to store data. The backend uses a relational database, but it mostly serves as a content repository.

Everyday Problems with the Status Quo

Content discusses entities. Those entities involve facts, which are data. These facts should be described with metadata, though they frequently are not.

A longstanding problem publishers face is that important facts are trapped within paragraphs of content that they create and publish. When the facts change, they are forced to manually revise all the writing that mentions these facts. Structuring content into chunks does not solve the problem of making changes within sentences. Often, factual information is mentioned within unique texts written by various authors, rather than within a single module that is centrally managed.

Most CMSs don’t support the ability to change information about an entity so that all paragraphs will update that information.

Let’s consider an example of a scenario that can be anticipated ahead of time. A number of paragraphs in different content items mention an application deadline date. The procedure for applying stays the same every year, but the exact date by which someone must apply will change each year. The application deadline is mentioned by different writers in different kinds of content: various announcement pages, blog posts, reminder emails, etc. In most CMSs today, the author will need to update each unique paragraph where the application is mentioned. They don’t have the ability to update each mention of the application date from one place.

Other facts can change, even if not predictably. Your community organization has for years staged important events in the Jubilee Auditorium at your headquarters. Lots of content talks about the Jubilee Auditorium. But suddenly a rich donor has decided to give your organization some money. To honor the donation, your organization decides to rename Jubilee Auditorium to the Ronald L Plutocrat Auditorium. After the excitement dies down, you realize that more than the auditorium plaque needs to change. All kinds of mentions of the auditorium are scattered throughout your online content.

These examples are inspired by real-life publishing situations.

Separating Concerns: Data and Content

Contrary to the view of some developers, I believe that content and data are different things, and need to be separated.

Content is more like computer code than it’s like data. Like computer code, content is about language and expression. Data is easy to compare and aggregate. Its values are tidy and predictable. Content is difficult to compare: it must be diff’d. Content can’t easily be aggregated, since most items of content are unique.

Each chunk of content is code that will be read by a browser. The body must indicate what text gets emphasis, what text has links, and what text is a list. Content is not like data generally stored in databases. It is unpredictable. It doesn’t evaluate to standard data types. Within a database, content can look like a messy glob that happens to have a field name attached to it.

The scripts that a CMS uses must manipulate this messy glob by evaluating each letter character-by-character. All kinds of meaning are embedded within a content chunk, and some it is hard to access.

The notion that content is just another form of data that can be stored and managed in a relational database with other data is the original sin of content management.

It’s considered good practice for developers to separate their data from their code. Developers though have a habit of co-mingling the two, which is why new software releases can be difficult to upgrade, and why moving between software applications is hard to do.

The inventor of the World Wide Web, Tim Berners-Lee, has lately been talking about the importance of separating data from code, “turning the way the web works upside-down.” He says: “It’s about separating the apps from the data.”

In a similar vain, content management needs to separate data from content.

Data Needs Independence

We need to fix the problem with the design of most CMSs, where the tail of data is fused together to the spine of the body. This makes the tail inflexible. The tail is dragged along with the body, instead of wagging on its own.

Data needs to become independent of specific content, so that it can be used flexibly. Customer-facing data needs to be stored separately from the content that customers view. There are many reasons why this is a good practice. And the good news is it’s been done already.

Separating factual data from content is not a new concept. Many large e-commerce websites have a separate database with all their product details that populates templates that are handled by a CMS. But this kind of use of specialized backend databases is limited in what it seeks to achieve. The external database may serve a single purpose: to populate tables within templates. Because most publishers don’t see themselves as data-driven publishers the way big ecommerce platforms are, they may not see the value of having a separate dedicated backend database.

Fortunately there’s a newer paradigm for storing data that is much more valuable. What’s different in the new vision is that data is defined as entity-based information, described with metadata standards.

The most familiar example of how an independent data store works with content is Wikipedia. The content we view on Wikipedia is updated by data stored in a separate repository called Wikidata. The relationship between Wikipedia and Wikidata is bidirectional. Articles mention factual information, which gets included in Wikidata. Other articles that mention the same information can draw on the Wikidata to populate the information within articles.

Facts are generally identified with a QID. The identifier Q95 represents Google. Google is a data variable. Depending on the context, Google can be referred to by Google Inc. (as a joint-stock company until 2017) Or Google LLC (as a limited liability company beginning in 2017). As a data value, the company name can adjust over time. Editors can also change the value when appropriate. Google became a subsidiary of Alphabet Inc. (Q20800404) in 2015. Some content, such as relating to financial performance will address that entity starting in 2015. Like many entities, companies change names and statuses over time.

How Wikipedia accesses Wikidata. Source: Wikidata

As an independent store of data, Wikidata supports a wide variety of articles, not just one content type. But its value extends beyond its support for Wikipedia articles. Wikidata is used by many other third party platforms to supply information. These include Google, Amazon’s Alexa, and the websites of various museums.

While few publishers operate of the scale of Wikipedia, the benefits of separating data from content can be realized on a small scale as well. An example is offered by the popular static website generator package called Jekyll, which is used by Github, Shopify, and other publishers. A plug in for Jekyll lets publishers store their data in the RDF format — a standard that offers significant flexibility. The data can be inserted into web content, but is a format where it can also be available for access by other platforms.

Making the Tail Flexible

Data needs to be used within different types of content, and across different channels — including channels not directly controlled by the publisher.

The CMS-centric approach, tethered to a relational database, tries to solve these issues by using APIs. Unfortunately, headless CMS vendors have interpreted the mantra of “create once, publish everywhere” to mean “enter all your digital information in our system, and the world will come to you, because we offer an API.”

Audiences need to know simple facts, such as what’s the telephone number for member services, in the case of a membership organization. They may need to see that information within an article discussing a topic, or they may want to ask Google to tell them while they are making online payments. Such data doesn’t fit into comfortably into a specific structured content type. It’s too granular. One could put it into a larger contact details content type, but that would include lots of other information that’s not immediately relevant. Chunks of content, unlike data, are difficult to reuse in different scenarios. Content types, by design, are aligned with specific kinds of scenarios. But defined content structures used to build content types are clumsy supporting general purpose queries or cross-functional uses. And it wouldn’t help much to make the phone number into an API request. No ordinary publisher can expect the many third party platforms to read through their API documentation in the event that someone asks their voice bot service about a telephone number.

The only scaleable and flexible way to make data available is to use metadata standards that third party platforms understand. When using metadata standards, special a API isn’t necessary.

An independent data store (unlike a tethered database) offers two distinct advantages:

1. The data is multi-use, for both published content and to support other platforms (Google, voice bots, etc.)

2. The data is multi-source, coming from authors who create/add new data, from other IT systems, and even from outside sources

The ability of the data store to accept new data is also important. Publishers should grow their data so that they can offer factual information accurately and immediately, wherever it is needed. When authors mention new facts relating to entities, this information can be added to the database. In some cases authors will note what’s new and important to include, much like webmasters can note metadata relating to content using Google’s Data Highlighter tool. In other cases, tools using natural language processing can spot entities, and automatically add metadata. Metadata provides the mechanism by which data gets connected to content.

Metadata makes it easier to revise information that’s subject to change, especially information such as prices, dates, and availability. The latest data is stored in the database, and gets updated there. Content that mentions such information can indicate the variable abstractly, instead of using a changeable value. For example: “You must apply by {application date}.” As a general rule, CMSs don’t make using data variables an easy thing to do.

A separate data store makes it simpler to pull data coming from other sources. The data store describes information using metadata standards, making is easy to upload information from different sources. With many CMSs, it is cumbersome to pull information from outside parties. The CMS is like a bubble. Everything may work fine as long you as you never want to leave the bubble. That’s true for simple CMSs such as WordPress, and for even complex component CMSs (CCMSs) that support DITA. These hosts are self-contained. They don’t readily accept information from outside sources. The information needs to be entered into their special format, using their specific conventions. The information is not independent of the CMS. The CMS ends of defining the information, rather than simply using it.

A growing number of companies are developing enterprise knowledge graphs — their own sort of Wikidata. These are databases of the key facts that a company needs to refer to. Companies can use knowledge graphs to enhance the content they publish. This innovation is possible because these companies don’t rely on their CMS to manage their data.

— Michael Andrews

Tags metadata

Content Engineering

Metadata for Appreciation and Transparency

Post author By Michael Andrews
Post date October 9, 2018

Who supports your work? If you work in a non-profit or a university, that’s an important question. These organizations depend on the generosity of others. They should want the world know who is making what they do possible. Fortunately, new standards for metadata will make that happen.

Individuals and teams who work in the non-profit and academic sectors, who either do research or deliver projects, can use online metadata to raise their profiles. Metadata can help online audiences discover information about grants relating to advancing knowledge or helping others. The metadata can reveal who is making grants, who is getting them, and what the grants cover.

Grants Metadata

A new set of metadata terms is pending in the schema.org vocabulary relating to grants and funding. The terms can help individuals and organizations understand the funding associated with research and other kinds of goal-focused projects conducted by academics and non-profits. The funded item (property: fundedItem) could be anything. While it will often be research (a study or a book), or it could be delivery of a service such as training, curriculum development, environmental or historical restoration, inoculations, or conferences and festivals. There is no restriction on what kind of project or activity can be indicated.

The schema.org vocabulary is the most commonly used metadata standard for online information, and is used in Google search results, among other online platforms. So the release of new metadata terms in schema.org can have big implications for how people discover and assess information online.

A quick peek at the code will show how it works. Even if you aren’t familiar with what metadata code looks like, it is easy to understand. This example, from the schema.org website, shows that Caroline B Turner receives funding from the National Science Foundation (grant number 1448821). Congratulations, Dr. Turner! How cool is that?

<script type=“application/ld+json”>
{
“@context”: “http://schema.org”,
@type“: “Person“,
“name”: “Turner, Caroline B.”,
“givenName”: “Caroline B.”,
“familyName”: “Turner”,
“funding”: {
“@type”: “Grant”,
“identifier”: “1448821”
“funder”: {
“@type”: “Organization”,
“name”: National Science Foundation“,
“identifier”: “https://doi.org/10.13039/100000001”
}
}
}
</script>

The new metadata anticipates diverse scenarios. Funders can give grants to projects, organizations, or individuals. Grants can be monetary, or in-kind. These elements can be combined with other schema.org vocabulary properties to provide information about how much money went to different people and organizations, and what projects they went to.

Showing Appreciation

The first reason to let others know who supports you is to show appreciation. Organizations should want to use the metadata to give recognition to the funder, and encourage their continued future support.

The grants metadata helps people discover what kinds of organizations fund your work. Having funding can bring prestige to an organization. Many organizations are proud to let others know that their work was sponsored by a highly competitive grant. That can bring credibility to their work. As long as the funding organization enjoys a good reputation for being impartial and supporting high quality research, noting the funding organization is a big benefit to both the funder and the grant receiver. Who would want to hide the fact that they received a grant from the MacArthur Foundation, after all?

Appreciation can be expressed for in-kind grants as well. An organization can indicate that a local restaurant is a conference sponsor supplying the coffee and food.

Providing Transparency

The second reason to let others know who supports your work is to provide transparency. For some non-profits, the funding sources are opaque. In this age of widespread distrust, some readers may speculate about the motivations an organization if information about their finances is missing. The existence of dark money and anonymous donors fuels such distrust. A lack of transparency can spark speculations that might not be accurate. Such speculation can be reduced by disclosing the funder of any grants received.

While the funding source alone doesn’t indicate if the data is accurate, it can help others understand the provenience of the data. Corporations may have a self-interest in the results of research, and some foundations may have an explicit mission that could influence the kinds of research outcomes they are willing to sponsor. As foundations move away from unrestricted grants and toward impact investing, providing details about who sponsors your work can help others understand why you are doing specific kinds of projects.

Transparency about funding reduces uncertainty about conflicts of interest. There’s certainly nothing wrong with an organization funding research they hope will result in a certain conclusion. Pharmaceutical companies understandably hope that the new drugs they are developing will show promise in trials. They rely on third-parties to provide an independent review of a topic. Showing the funding relationship is central to convincing readers that the review is truly independent. If a funding relationship is not disclosed but is hidden, readers will doubt the independence of the researcher, and question the credibility of the results.

It’s common practice for researchers to acknowledge any potential conflict of interest, such as having received money from a source that has a vested interested in what is being reported. The principle of transparency applies not only to doctors reporting on medical research, but also to less formal research. Investment research often indicates if the writer has any ownership of stocks he or she is talking about. And news outlets increasingly note when reporting on a company if that company directly or indirectly owns the outlet. When writing about Amazon, The Washington Post will note “Bezos also owns The Washington Post.”

If the writer presents even the appearance that their judgment was influenced by a financial relationship, they should disclose that relationship to readers. Transparency is an expectation of readers, even though publishers are uneven in their application of transparency.

Right now, transparency is hard for readers to crack. Better metadata could help.

Current Problems with Funding Transparency

Transparency matters for any issue that’s subject to debate or verification, or open to interpretation. One such issue I’m familiar with is antitrust — whether certain firms have too much (monopoly) market power. It’s an issue that has been gaining interest across the globe by people holding different political persuasions, but it’s an issue where there is a range of views and cited evidence. Even if you are not be interested in this specific issue, the example of content relating to antitrust illustrates why greater transparency through metadata can be helpful.

A couple of blocks from my home in the Washington DC area is an institution that’s deeply involved in the antitrust policy debate: the Antonin Scalia Law School at George Mason University (GMU), a state-funded university that I financially support as a taxpayer. GMU is perhaps best-known for the pro-market, anti-regulation views of its law and economics faulty. It is the academic home of New York Times columnist Tyler Cowen, and has produced a lot of research and position papers on issues such as copyright, data privacy, and antitrust issues. Last month GMU hosted public hearings for the US Federal Trade Commission (FTC) on the future of antitrust policy.

Earlier this year, GMU faced a transparency controversy. As a state-funded university, it was subject to a Freedom of Information Act (FOIA) request about funding grants it receives. The request revealed that the Charles Koch Foundation had provided an “estimated $50 million” in grants to George Mason University to support their law and economic programs, according to the New York Times. Normally, generosity of that scale would be acknowledged by naming a building after the donor. But in this case the scale of donations only came to light after the FOIA request. Some of this funding entailed conditions that could be seen as compromising the independence of the researchers using the funds.

The New York Times noted that the FOIA also revealed a another huge gift to GMU: “executives of the Federalist Society, a conservative national organization of lawyers, served as agents for a $20 million gift from an anonymous donor.” What’s at issue is not whether political advocacy groups are entitled to provide grants, or whether or not the funded research is valid. What’s problematic is that research funding was not transparent.

Right now, it is difficult for citizens to “follow the money” when it comes to corporate-sponsored research on public policy issues such as the future of antitrust. Corporations are willing to provide funding for research that is sympathetic to their positions, but may not want to draw attention to their funding.

In the US, the EU, and elsewhere, elected officials and government regulators have discussed the possibility of bringing new antitrust investigations against Google. For many years, Google has funded research countering arguments that it should be subject to antitrust regulation. But Google has faced its own controversies about its funding transparency, according to a report from the Google Transparency Project, part of the Campaign for Accountability, which describes itself as “a 501(c)(3) non-profit, nonpartisan watchdog organization.” The report “Google Academics” asserts: “Eric Schmidt, then Google’s chief executive, cited a Google-funded author in written answers to Congress to back his contention that his company wasn’t a monopoly. He didn’t mention Google had paid for the paper.”

Google champions the use of metadata, especially the schema.org vocabulary. As Wikipedia notes, “Google’s mission statement is ‘to organize the world’s information and make it universally accessible and useful.’” I like Google for doing that, and hold them to a high standard for transparency precisely because their mission is making information accessible.

Google provides hundreds research grants to academics and others. How easy it is to know who Google funds? The Google Transparency Project tried to find out who Google funds by using Google Scholar, Google’s online search engine for academic papers. There was no direct way for them to search by funding source.

Searching for grants information without the benefit of metadata is very difficult. Source: Google Transparency Project, “Google Academics” report

They needed to search for phrases such as “grateful to Google.” That’s far short of making information accessible and useful. The funded researchers could express their appreciation more effectively by using metadata to indicate grants funding.

Google Transparency Project produced another report on the antitrust policy hearings that the FTC sponsored at GMU last month. The report, entitled “FTC Tech Hearings Heavily Feature Google-funded Speakers” concludes:“A third of speakers have financial ties to Google, either directly or through their employer. The FTC has not disclosed those ties to attendees.” Many of the speakers Google funded were current or former faculty of GMU, according to the report.

I leave it to the reader to decide if the characterizations of the Google Transparency Project are fair and accurate. Assessing their report requires looking at footnotes and checking original sources. How much easier it would be if all the relevant information were captured in metadata, instead of scattered around in text documents.

Right now it is difficult to use Google Scholar to find out what academic research was funded by any specific company or foundation. I can only hope that funders of research, Google included, will encourage those who receive their grants to reveal that sponsorship within the metadata relating to the research. And that recipients will add funding metadata to their online profiles.

The Future of Grants & Funding Metadata

How might the general public benefit from metadata on grants funding? Individuals may want to know what projects or people a funder supports. They want to see how funding sources have changed over time for an organization.

These questions could be answered by a service such as Google, Bing, or Wolfram Alpha. More skilled users could even design their own query of the metadata by using a SPARQL query (SPARQL is query language for semantic metadata). No doubt many journalists, grants-receiving organizations, and academics will find this information valuable.

Imagine if researchers at taxpayer-supported institutions such as GMU were required to indicate their funding sources within metadata. Or if independent non-profits made it a condition of receiving funding that they indicate the source within metadata. Imagine if the public expected full transparency about funding sources as the norm, rather than as something optional to disclose.

How You can get Involved

If you make or receive grants, you can start using the pending Grants metadata now in anticipation of its formal release. Metadata allows an individual to write information once, and reuse it often. When metadata is used to indicate funding, organizations have less worry about forgetting to mention a relationship in a specific context. The information about the relationship is discoverable online.

Note that the specifics of the grants proposal could change when it is released, though I expect they would most likely be tweaks rather than drastic revisions. Some specific details of the proposal will most interest research scientists who are concerns with research productivity and impact metrics that are of less interest to researchers working in public policy and other areas. While the grants proposal has been under discussion for several years now, the momentum for final release is building and it will hopefully be finalized before long. Many researchers plan to use the newly-released metadata terms for datasets, and want including funder information as part of their dataset metadata. (Sharing research data is often a condition of research grants, so it makes sense to add funding sponsorship to the datasets.)

If you have suggestions or concerns about the proposal, you can contribute your feedback to the schema.org community GitHub issue (no 383) for grants. Schema.org is a W3C community, and is open to contributions from anyone.

— Michael Andrews

Tags metadata