Categories
Content Engineering

Reading, Writing, and Entities

Reading involves work and writing is difficult.  Countless books are available on how to write.  What could be left to say?  In view of all the writing advice that’s available, it’s surprising that one topic gets scant coverage: entities.  Not many writers talk about their use of entities in their writing.  I believe entities can be powerful lens for considering writing and the reader experience.

What’s an entity?  It is not a word used much in colloquial speech.  But it’s a handy term to refer to nouns that have a specific identity.  Merriam-Webster lists some synonyms of entity: “being, commodity, individual,  object, substance, thing.”  These words used to suggest the idea of an entity may seem vague, but specific examples of entities can be concrete. Most commonly people associate entities with organizational units, such as a corporate entity.  But the term can refer to all kinds of things: people, places, materials, concepts, brands, time periods, or space aliens.   Merriam-Webster cites the following usage example: “the question of whether extrasensory perception will ever be a scientifically recognized entity.”  In this example, the term entity refers to a phenomenon that many people don’t consider real: ESP.  The characters in Harry Potter novels can be entities, as can a celebrity or a football team.  

Perhaps the easiest way to think about an entity is to ask if something would have an entry in the encyclopedia — if so, it is likely an entity.  Entities are nouns referring to a type of thing  (a category, such as mountains), or a specific individual example of a thing (a proper noun, such as the Alps).  Not all nouns are entities: they need to specific and not generic.  A window would probably be too generic to be an entity — without further information, the reader won’t think too much about it.  A double glazed window, or a window on the Empire State Building, would be an entity, because there’s enough context to differentiate it from generic examples.  Windows as a category could be an entity, since they can be considered in terms of their global properties and variations: frosted windows, stain-glass windows, etc.  While there is no hard and fast rule about what’s considered an entity, the more salient something is in the text, the more likely it is be an entity.  A single mention of a generic window would not be an entity, but a longer discussion of windows as an architectural feature would be.

Entities are interesting in writing because they carry semantic meaning (as opposed to other kinds of meaning such as mood or credibility.)  Entities in writing make writing less generic.  They overlap with the concept of detail in writing.  But the role that entities play is different from making writing vivid by providing detail.  Entities are the foreground of the writing, not the background.  Many details in writing such as the brand of scarf that a protagonist wears are not terribly important.  Details are background color and in some writing are extraneous.  Entities, in contrast, are the key factual details mentioned in the text.  They can be central the content’s meaning.

Ease of reading and understanding

Clarity is an obsession of many writers.  Entities can play an important role in clarity.

I became more mindful of the role of entities in writing while reading a recent book of jazz criticism by Nate Chinen.  I enjoy learning about jazz, and the writer is very knowledgeable on the subject. He personally knows many of the people he writes about, and can draw numerous connections between artists and their works.  Yet the book was difficult to read.  I realized that the book talked about too many entities, too quickly.  A single sentence could mention artists, works, dates, places, music style, and awards.  While I know a bit about jazz, my mind was often overloaded with details, some of which I didn’t understand completely.  I felt the author was at times was “name checking” by dropping names of people and things he knew and that the reader should be impressed he knew.

Chinen created what I’ll call “dense content” — text that’s full of entities.  His writing provides a negative example of dense writing.  But not all dense content is necessarily hard to understand.

If dense content can be difficult to understand, is light content a better option?  Should entities be mentioned sparingly?

Light content is favored by champions of readability.  Writing should be simple and easy to read, and readability advocates have devised formulas to measure how readable a text is.  Texts are scored according to different criteria that are believed to influence readability:

  1. Sentence length
  2. Syllables per sentence.
  3. Ratio of commonly used words as a portion of the entire text.

All these metrics favor the use of short, simple words, and tend to penalize extensive reference to entities, which can be unfamiliar and longer words.

So if readability scores are maximized, does understanding improve?  Not necessarily.  Highly readable content, at least as scored according to these metrics, may in fact be vague content that’s full of generalities and lacking concrete examples.  The concept of readability confuses syntactical issues (the formation of sentences) with semantic ones (the meaning of sentences).  Ease of reading is only partly correlated with depth of understanding.

The empty mind versus the knowing  mind

One of the limitations of readability as an approach is that it doesn’t consider the reader’s prior knowledge of a topic.  It assumes the reader has an empty mind about the topic, and so nothing should be in doubt as to meaning.  Readability incorporates a generic idea of education level, but it is silent about what different people know already.  For example, my annoyance at the jazz criticism book may a sign that I wasn’t the target audience for the book: I over-estimated my knowledge, and have blamed the author for making me feel unknowledgeable.  Indeed, some readers are enthusiastic about the dense detail in the book.  I, however, wanted more background about these details if they were considered important to mention.  

One way to extend the concept of readability to incorporate understanding is to measure the use of entities in writing.  I would suggest two concepts:

  1. Entity density
  2. Entity novelty

Entity density refers to how many different entities are mentioned in the text.  Some text will be more dense with entities compared with other text.  Entity density could measure entities per sentence, or total entities mentioned in an article.  Computers can already recognize entities in text, so an application could easily calculate the number of entities in the article, and the average per sentence.  

Example of computer recognition of entities in text.

 

Entity novelty takes the idea a step further.  It asks: how many new entities are introduced in the text for the reader?  For example, I’ve been discussing an entity called “readability.”  I am assuming the reader has an idea what I am referring to.  If not, readability would be a novel entity for the reader.  It is more difficult to calculate the number of unknown entities within a text.  Perhaps reading apps could track if the entity has been frequently encountered previously.  If it was, then it could assume it was no longer novel.

The idea behind these metrics is to highlight how entities can be either helpful or distracting.  The text could have many entities and be helpful to the reader, if the reader was already familiar with the entities.  The text can include unfamiliar entities, provided there aren’t too many.  But if the text has too many entities that are novel for the reader, both readability and understanding may suffer.

Scanning and entities

Another dimension that readability metrics miss is the scan-ability of text.  The assumption of readability is that the entire text will be read.  In practice, many readers choose what parts of the text to read based on interests and relevance.  The mention of entities in text can influence how easily readers can find text of interest.  Readers may be looking for indications that the text contains material that they:

  • Already know
  • Are not interested in
  • Know they are interested in
  • Find unfamiliar but are curious about.

Instead of considering text from the perspective of the “empty mind,” scan-ability considers text from the perspective the “knowing mind.”  Readers often search for concrete words in text, especially capitalized proper nouns.  Vague, generic text is hard to scan.  

Imagine a reader who wants to know about Japan’s banking system.  What entity would they look for?  That will depend partly on their existing knowledge.  If they want to know who is in charge of banking in Japan, they will look for mentions of specific entities.  Perhaps they know the name of the person and will search for that name.  Or they may not know the name of the person, but have an idea of their formal title so they will look for a mention of the words “Japan,” “Bank,” and “Governor.”    If they don’t know the formal title, they might look for mentions of a person’s role, such as “head of the central bank.”  In text, all these entities (name, title, and role) could appear in a paragraph on the topic.   All aid in the scanning of text.

Entities can help readers find information another way as well.  Entities can be described with metadata, which makes the information much easier to find online when searching and browsing.  When computers describe entities, they can keep track of different terms used to describe them, so that readers can find what they need whether or not they know about the topic already.  Metadata can connect different aspects of an entity, so that people can search for a name, a title, or a role and be taken to the same information.

— Michael Andrews

Categories
Content Engineering

Metadata for Appreciation and Transparency

Who supports your work? If you work in a non-profit or a university, that’s an important question. These organizations depend on the generosity of others. They should want the world know who is making what they do possible. Fortunately, new standards for metadata will make that happen.

Individuals and teams who work in the non-profit and academic sectors, who either do research or deliver projects, can use online metadata to raise their profiles. Metadata can help online audiences discover information about grants relating to advancing knowledge or helping others. The metadata can reveal who is making grants, who is getting them, and what the grants cover.

Grants Metadata

A new set of metadata terms is pending in the schema.org vocabulary relating to grants and funding. The terms can help individuals and organizations understand the funding associated with research and other kinds of goal-focused projects conducted by academics and non-profits. The funded item (property: fundedItem) could be anything. While it will often be research (a study or a book), or it could be delivery of a service such as training, curriculum development, environmental or historical restoration, inoculations, or conferences and festivals. There is no restriction on what kind of project or activity can be indicated.

The schema.org vocabulary is the most commonly used metadata standard for online information, and is used in Google search results, among other online platforms. So the release of new metadata terms in schema.org can have big implications for how people discover and assess information online.

A quick peek at the code will show how it works. Even if you aren’t familiar with what metadata code looks like, it is easy to understand. This example, from the schema.org website, shows that Caroline B Turner receives funding from the National Science Foundation (grant number 1448821). Congratulations, Dr. Turner! How cool is that?

  1. <script type=“application/ld+json”>
  2. {
  3.   “@context”: “http://schema.org”,
  4.   @type“: “Person“,
  5.   “name”: “Turner, Caroline B.”,
  6.   “givenName”: “Caroline B.”,
  7.   “familyName”: “Turner”,
  8.   “funding”: {
  9.      “@type”: “Grant”,
  10.      “identifier”: “1448821”
  11.      “funder”: {
  12.        “@type”: “Organization”,
  13.        “name”: National Science Foundation“,
  14.        “identifier”: “https://doi.org/10.13039/100000001”
  15.      }
  16.    }
  17. }
  18. </script>

 

The new metadata anticipates diverse scenarios. Funders can give grants to projects, organizations, or individuals. Grants can be monetary, or in-kind. These elements can be combined with other schema.org vocabulary properties to provide information about how much money went to different people and organizations, and what projects they went to.

Showing Appreciation

The first reason to let others know who supports you is to show appreciation. Organizations should want to use the metadata to give recognition to the funder, and encourage their continued future support.

The grants metadata helps people discover what kinds of organizations fund your work. Having funding can bring prestige to an organization. Many organizations are proud to let others know that their work was sponsored by a highly competitive grant. That can bring credibility to their work. As long as the funding organization enjoys a good reputation for being impartial and supporting high quality research, noting the funding organization is a big benefit to both the funder and the grant receiver. Who would want to hide the fact that they received a grant from the MacArthur Foundation, after all?

Appreciation can be expressed for in-kind grants as well. An organization can indicate that a local restaurant is a conference sponsor supplying the coffee and food.

Providing Transparency

The second reason to let others know who supports your work is to provide transparency. For some non-profits, the funding sources are opaque. In this age of widespread distrust, some readers may speculate about the motivations an organization if information about their finances is missing. The existence of dark money and anonymous donors fuels such distrust. A lack of transparency can spark speculations that might not be accurate. Such speculation can be reduced by disclosing the funder of any grants received.

While the funding source alone doesn’t indicate if the data is accurate, it can help others understand the provenience of the data. Corporations may have a self-interest in the results of research, and some foundations may have an explicit mission that could influence the kinds of research outcomes they are willing to sponsor. As foundations move away from unrestricted grants and toward impact investing, providing details about who sponsors your work can help others understand why you are doing specific kinds of projects.

Transparency about funding reduces uncertainty about conflicts of interest. There’s certainly nothing wrong with an organization funding research they hope will result in a certain conclusion. Pharmaceutical companies understandably hope that the new drugs they are developing will show promise in trials. They rely on third-parties to provide an independent review of a topic. Showing the funding relationship is central to convincing readers that the review is truly independent. If a funding relationship is not disclosed but is hidden, readers will doubt the independence of the researcher, and question the credibility of the results.

It’s common practice for researchers to acknowledge any potential conflict of interest, such as having received money from a source that has a vested interested in what is being reported. The principle of transparency applies not only to doctors reporting on medical research, but also to less formal research. Investment research often indicates if the writer has any ownership of stocks he or she is talking about. And news outlets increasingly note when reporting on a company if that company directly or indirectly owns the outlet. When writing about Amazon, The Washington Post will note “Bezos also owns The Washington Post.”

If the writer presents even the appearance that their judgment was influenced by a financial relationship, they should disclose that relationship to readers. Transparency is an expectation of readers, even though publishers are uneven in their application of transparency.

Right now, transparency is hard for readers to crack. Better metadata could help.

Current Problems with Funding Transparency

Transparency matters for any issue that’s subject to debate or verification, or open to interpretation. One such issue I’m familiar with is antitrust — whether certain firms have too much (monopoly) market power. It’s an issue that has been gaining interest across the globe by people holding different political persuasions, but it’s an issue where there is a range of views and cited evidence. Even if you are not be interested in this specific issue, the example of content relating to antitrust illustrates why greater transparency through metadata can be helpful.

A couple of blocks from my home in the Washington DC area is an institution that’s deeply involved in the antitrust policy debate: the Antonin Scalia Law School at George Mason University (GMU), a state-funded university that I financially support as a taxpayer. GMU is perhaps best-known for the pro-market, anti-regulation views of its law and economics faulty. It is the academic home of New York Times columnist Tyler Cowen, and has produced a lot of research and position papers on issues such as copyright, data privacy, and antitrust issues. Last month GMU hosted public hearings for the US Federal Trade Commission (FTC) on the future of antitrust policy.

Earlier this year, GMU faced a transparency controversy. As a state-funded university, it was subject to a Freedom of Information Act (FOIA) request about funding grants it receives. The request revealed that the Charles Koch Foundation had provided an “estimated $50 million” in grants to George Mason University to support their law and economic programs, according to the New York Times. Normally, generosity of that scale would be acknowledged by naming a building after the donor. But in this case the scale of donations only came to light after the FOIA request. Some of this funding entailed conditions that could be seen as compromising the independence of the researchers using the funds.

The New York Times noted that the FOIA also revealed a another huge gift to GMU: “executives of the Federalist Society, a conservative national organization of lawyers, served as agents for a $20 million gift from an anonymous donor.” What’s at issue is not whether political advocacy groups are entitled to provide grants, or whether or not the funded research is valid. What’s problematic is that research funding was not transparent.

Right now, it is difficult for citizens to “follow the money” when it comes to corporate-sponsored research on public policy issues such as the future of antitrust. Corporations are willing to provide funding for research that is sympathetic to their positions, but may not want to draw attention to their funding.

In the US, the EU, and elsewhere, elected officials and government regulators have discussed the possibility of bringing new antitrust investigations against Google. For many years, Google has funded research countering arguments that it should be subject to antitrust regulation. But Google has faced its own controversies about its funding transparency, according to a report from the Google Transparency Project, part of the Campaign for Accountability, which describes itself as “a 501(c)(3) non-profit, nonpartisan watchdog organization.” The report “Google Academics” asserts: “Eric Schmidt, then Google’s chief executive, cited a Google-funded author in written answers to Congress to back his contention that his company wasn’t a monopoly. He didn’t mention Google had paid for the paper.”

Google champions the use of metadata, especially the schema.org vocabulary. As Wikipedia notes, “Google’s mission statement is ‘to organize the world’s information and make it universally accessible and useful.’” I like Google for doing that, and hold them to a high standard for transparency precisely because their mission is making information accessible.

Google provides hundreds research grants to academics and others. How easy it is to know who Google funds? The Google Transparency Project tried to find out who Google funds by using Google Scholar, Google’s online search engine for academic papers. There was no direct way for them to search by funding source.

Searching for grants information without the benefit of metadata is very difficult. Source: Google Transparency Project, “Google Academics” report

They needed to search for phrases such as “grateful to Google.” That’s far short of making information accessible and useful. The funded researchers could express their appreciation more effectively by using metadata to indicate grants funding.

Google Transparency Project produced another report on the antitrust policy hearings that the FTC sponsored at GMU last month. The report, entitled “FTC Tech Hearings Heavily Feature Google-funded Speakers” concludes:“A third of speakers have financial ties to Google, either directly or through their employer. The FTC has not disclosed those ties to attendees.” Many of the speakers Google funded were current or former faculty of GMU, according to the report.

I leave it to the reader to decide if the characterizations of the Google Transparency Project are fair and accurate. Assessing their report requires looking at footnotes and checking original sources. How much easier it would be if all the relevant information were captured in metadata, instead of scattered around in text documents.

Right now it is difficult to use Google Scholar to find out what academic research was funded by any specific company or foundation. I can only hope that funders of research, Google included, will encourage those who receive their grants to reveal that sponsorship within the metadata relating to the research. And that recipients will add funding metadata to their online profiles.

The Future of Grants & Funding Metadata

How might the general public benefit from metadata on grants funding? Individuals may want to know what projects or people a funder supports. They want to see how funding sources have changed over time for an organization.

These questions could be answered by a service such as Google, Bing, or Wolfram Alpha. More skilled users could even design their own query of the metadata by using a SPARQL query (SPARQL is query language for semantic metadata). No doubt many journalists, grants-receiving organizations, and academics will find this information valuable.

Imagine if researchers at taxpayer-supported institutions such as GMU were required to indicate their funding sources within metadata. Or if independent non-profits made it a condition of receiving funding that they indicate the source within metadata. Imagine if the public expected full transparency about funding sources as the norm, rather than as something optional to disclose.

How You can get Involved

If you make or receive grants, you can start using the pending Grants metadata now in anticipation of its formal release. Metadata allows an individual to write information once, and reuse it often. When metadata is used to indicate funding, organizations have less worry about forgetting to mention a relationship in a specific context. The information about the relationship is discoverable online.

Note that the specifics of the grants proposal could change when it is released, though I expect they would most likely be tweaks rather than drastic revisions. Some specific details of the proposal will most interest research scientists who are concerns with research productivity and impact metrics that are of less interest to researchers working in public policy and other areas. While the grants proposal has been under discussion for several years now, the momentum for final release is building and it will hopefully be finalized before long. Many researchers plan to use the newly-released metadata terms for datasets, and want including funder information as part of their dataset metadata. (Sharing research data is often a condition of research grants, so it makes sense to add funding sponsorship to the datasets.)

If you have suggestions or concerns about the proposal, you can contribute your feedback to the schema.org community GitHub issue (no 383) for grants. Schema.org is a W3C community, and is open to contributions from anyone.

— Michael Andrews