Reading, Writing, and Entities

Reading involves work and writing is difficult.  Countless books are available on how to write.  What could be left to say?  In view of all the writing advice that’s available, it’s surprising that one topic gets scant coverage: entities.  Not many writers talk about their use of entities in their writing.  I believe entities can be powerful lens for considering writing and the reader experience.

What’s an entity?  It is not a word used much in colloquial speech.  But it’s a handy term to refer to nouns that have a specific identity.  Merriam-Webster lists some synonyms of entity: “being, commodity, individual,  object, substance, thing.”  These words used to suggest the idea of an entity may seem vague, but specific examples of entities can be concrete. Most commonly people associate entities with organizational units, such as a corporate entity.  But the term can refer to all kinds of things: people, places, materials, concepts, brands, time periods, or space aliens.   Merriam-Webster cites the following usage example: “the question of whether extrasensory perception will ever be a scientifically recognized entity.”  In this example, the term entity refers to a phenomenon that many people don’t consider real: ESP.  The characters in Harry Potter novels can be entities, as can a celebrity or a football team.  

Perhaps the easiest way to think about an entity is to ask if something would have an entry in the encyclopedia — if so, it is likely an entity.  Entities are nouns referring to a type of thing  (a category, such as mountains), or a specific individual example of a thing (a proper noun, such as the Alps).  Not all nouns are entities: they need to specific and not generic.  A window would probably be too generic to be an entity — without further information, the reader won’t think too much about it.  A double glazed window, or a window on the Empire State Building, would be an entity, because there’s enough context to differentiate it from generic examples.  Windows as a category could be an entity, since they can be considered in terms of their global properties and variations: frosted windows, stain-glass windows, etc.  While there is no hard and fast rule about what’s considered an entity, the more salient something is in the text, the more likely it is be an entity.  A single mention of a generic window would not be an entity, but a longer discussion of windows as an architectural feature would be.

Entities are interesting in writing because they carry semantic meaning (as opposed to other kinds of meaning such as mood or credibility.)  Entities in writing make writing less generic.  They overlap with the concept of detail in writing.  But the role that entities play is different from making writing vivid by providing detail.  Entities are the foreground of the writing, not the background.  Many details in writing such as the brand of scarf that a protagonist wears are not terribly important.  Details are background color and in some writing are extraneous.  Entities, in contrast, are the key factual details mentioned in the text.  They can be central the content’s meaning.

Ease of reading and understanding

Clarity is an obsession of many writers.  Entities can play an important role in clarity.

I became more mindful of the role of entities in writing while reading a recent book of jazz criticism by Nate Chinen.  I enjoy learning about jazz, and the writer is very knowledgeable on the subject. He personally knows many of the people he writes about, and can draw numerous connections between artists and their works.  Yet the book was difficult to read.  I realized that the book talked about too many entities, too quickly.  A single sentence could mention artists, works, dates, places, music style, and awards.  While I know a bit about jazz, my mind was often overloaded with details, some of which I didn’t understand completely.  I felt the author was at times was “name checking” by dropping names of people and things he knew and that the reader should be impressed he knew.

Chinen created what I’ll call “dense content” — text that’s full of entities.  His writing provides a negative example of dense writing.  But not all dense content is necessarily hard to understand.

If dense content can be difficult to understand, is light content a better option?  Should entities be mentioned sparingly?

Light content is favored by champions of readability.  Writing should be simple and easy to read, and readability advocates have devised formulas to measure how readable a text is.  Texts are scored according to different criteria that are believed to influence readability:

  1. Sentence length
  2. Syllables per sentence.
  3. Ratio of commonly used words as a portion of the entire text.

All these metrics favor the use of short, simple words, and tend to penalize extensive reference to entities, which can be unfamiliar and longer words.

So if readability scores are maximized, does understanding improve?  Not necessarily.  Highly readable content, at least as scored according to these metrics, may in fact be vague content that’s full of generalities and lacking concrete examples.  The concept of readability confuses syntactical issues (the formation of sentences) with semantic ones (the meaning of sentences).  Ease of reading is only partly correlated with depth of understanding.

The empty mind versus the knowing  mind

One of the limitations of readability as an approach is that it doesn’t consider the reader’s prior knowledge of a topic.  It assumes the reader has an empty mind about the topic, and so nothing should be in doubt as to meaning.  Readability incorporates a generic idea of education level, but it is silent about what different people know already.  For example, my annoyance at the jazz criticism book may a sign that I wasn’t the target audience for the book: I over-estimated my knowledge, and have blamed the author for making me feel unknowledgeable.  Indeed, some readers are enthusiastic about the dense detail in the book.  I, however, wanted more background about these details if they were considered important to mention.  

One way to extend the concept of readability to incorporate understanding is to measure the use of entities in writing.  I would suggest two concepts:

  1. Entity density
  2. Entity novelty

Entity density refers to how many different entities are mentioned in the text.  Some text will be more dense with entities compared with other text.  Entity density could measure entities per sentence, or total entities mentioned in an article.  Computers can already recognize entities in text, so an application could easily calculate the number of entities in the article, and the average per sentence.  

Example of computer recognition of entities in text.

 

Entity novelty takes the idea a step further.  It asks: how many new entities are introduced in the text for the reader?  For example, I’ve been discussing an entity called “readability.”  I am assuming the reader has an idea what I am referring to.  If not, readability would be a novel entity for the reader.  It is more difficult to calculate the number of unknown entities within a text.  Perhaps reading apps could track if the entity has been frequently encountered previously.  If it was, then it could assume it was no longer novel.

The idea behind these metrics is to highlight how entities can be either helpful or distracting.  The text could have many entities and be helpful to the reader, if the reader was already familiar with the entities.  The text can include unfamiliar entities, provided there aren’t too many.  But if the text has too many entities that are novel for the reader, both readability and understanding may suffer.

Scanning and entities

Another dimension that readability metrics miss is the scan-ability of text.  The assumption of readability is that the entire text will be read.  In practice, many readers choose what parts of the text to read based on interests and relevance.  The mention of entities in text can influence how easily readers can find text of interest.  Readers may be looking for indications that the text contains material that they:

  • Already know
  • Are not interested in
  • Know they are interested in
  • Find unfamiliar but are curious about.

Instead of considering text from the perspective of the “empty mind,” scan-ability considers text from the perspective the “knowing mind.”  Readers often search for concrete words in text, especially capitalized proper nouns.  Vague, generic text is hard to scan.  

Imagine a reader who wants to know about Japan’s banking system.  What entity would they look for?  That will depend partly on their existing knowledge.  If they want to know who is in charge of banking in Japan, they will look for mentions of specific entities.  Perhaps they know the name of the person and will search for that name.  Or they may not know the name of the person, but have an idea of their formal title so they will look for a mention of the words “Japan,” “Bank,” and “Governor.”    If they don’t know the formal title, they might look for mentions of a person’s role, such as “head of the central bank.”  In text, all these entities (name, title, and role) could appear in a paragraph on the topic.   All aid in the scanning of text.

Entities can help readers find information another way as well.  Entities can be described with metadata, which makes the information much easier to find online when searching and browsing.  When computers describe entities, they can keep track of different terms used to describe them, so that readers can find what they need whether or not they know about the topic already.  Metadata can connect different aspects of an entity, so that people can search for a name, a title, or a role and be taken to the same information.

— Michael Andrews

Auditing Metadata Serialized in JSON-LD

As websites publish more metadata, publishers need ways to audit what they’ve published. This post will look at a tool called jq that can be used to audit metadata.

Metadata code is invisible to audiences. It operates behind the scenes. To find out what metadata exists entails looking the source code, squinting at a jumble of div tags, css, javascript and other stuff. Glancing at the source code is not a very efficient way to see what metadata is included with the content. Publishers need easy ways for their web teams to find out what metadata they’ve published.

This discussion will focus on metadata that’s serialized in the JSON-LD format. One nice thing about JSON-LD is that it separates the metadata from other code, making it easier to locate. For those not familiar with JSON-LD, a brief introduction. JSON-LD is the latest format for encoding web metadata, especially the widely-used schema.org vocabulary. JSON-LD is still less pervasive than microdata and RDFa, which are described within HTML elements. But JSON-LD has quickly emerged as preferred the syntax for many websites. It is more developer-friendly than HTML syntaxes, and shares a common heritage with the widely-used JSON data format.

According to statistics, around 225,000 websites are using JSON-LD. That’s about 21% of all websites globally, and is nearly 30% of English language websites. Some major sites using JSON-LD for metadata include Apple, Booking.com, Ebay, LinkedIn, and Yelp.

Why Audit Metadata?

I’ve previously touched on the value of auditing metadata in my book, Metadata Basics for Web Content. For this discussion, I want to highlight a few specific benefits.

For those who work with SEO, the value of knowing what metadata exists is obvious: it influences discovery through search. But content creators will also want to know the metadata profile of their content. It can yield important insights useful for editorial planning.

Metadata provides a useful summary of the key information within published content. Reviewing metadata can provide a quick synopsis of what the content is about. At the same time, if metadata is missing, that means that machines can’t find the key information that audiences will want to know when viewing the content.

Auditing can reveal:

  • what key information is included in the content
  • if any important properties are missing that should be included

Online publishers should routinely audit their own metadata. And they may decide they’d benefit by auditing their competitor’s metadata as well. Generally, the more detailed and complete the metadata is, the more likely a publisher will be successful with their content. So seeing how well one’s own metadata compares with one’s competitors can reveal important insights into how readily audiences can access information.

How to Audit JSON-LD metadata

Metadata is code, written for machines. So how can members of web teams, whether writers or SEO specialists, get a quick sense of what metadata they have currently? Since I have mission to evangelize the benefits of metadata to all content stakeholders, including less technical ones, I’ve been looking for light-weight ways to help all kinds of people discover what metadata they have.

For metadata encoded in HTML tags, the simplest way to explore it is using XPath, a simple filter query that searches down the DOM tree to find the relevant part containing the metadata. XPath is not too hard to learn (at least for basic needs), and is available within common tools such as Google Sheets.

Unfortunately, XPath can’t be used for metadata in JSON-LD. But happily, there is an equivalent to XPath that can be used to query JSON-based metadata. It is called jq.

The first step to doing an audit is to extract the JSON-LD from the website you want to audit. It lives within the element <script type= application/ld+json></script>. Even if you need to manually extract the JSON-LD, it is easy to find in the source code (use CTR-F and search for ld+json). Be aware that there may be more than one JSON-LD metadata statement available. For example, when looking at the source code of a webpage on Apple’s website, I notice three JSON-LD script elements representing three different statements: one covering product information (Offer), one covering the company (Organization), and another covering the website structure (BreadcrumbList). Some automated tools have been known to stop harvesting JSON-LD statements after finding the first one, so make sure you get them all, especially the ones with information unique to the webpage.

Once you have collected the JSON-LD statements, you can begin to audit them to see what information they contain. Much like a content audit, you can set up a spreadsheet to track metadata for specific URLs.

Exploring JSON-LD with jq

jq is a “command line” application, which can present a hurdle for non-developers. But an online version of it exists called jq Play that is easy to use.

Although jq was designed for filtering ordinary plain JSON, it can also be used for JSON-LD. Just paste your JSON-LD statement in jq Play, and add a filter.

Let’s look at some simple filters that can identify important information in JSON-LD statements.

The first filter can tell us what properties are mentioned in the metadata. We can find that out using the “keys” filter. Type keys and you will get a list of properties at the highest level of the tree. Some of these have an @ symbol, indicating the are structural properties (for example "@context", "@id", "@type"). Don’t worry about those for now. Others will resemble words and be more understandable, for example, “contactPoint”, “logo”, “name”, “sameAs”, and “url”. These keys, from Apple’s Organization statement, tells us the kinds of information Apple includes about itself on its website.

JSON-LD statements on Apple.com

Let’s suppose we have JSON-LD for an event. An event has many different kinds of entities associated with it, such as a location, the event’s name, and the performer. It would be nice to know what entities are mentioned in the metadata. All kinds of entities use a common property: name. Filtering on the name property can let us know what entities are mentioned in the metadata.

Using jq, we find out entities by using the filter ..|.name? which provides a list of names. When applied to a JSON-LD code sample from the schema.org website, we get the names associated with the Event: the name of the orchestra, the auditorium, the conductor, and the two symphonic works.

The filter was constructed using a pattern ..|.foo? (foo is a jibberish name to indicate any property you want to filter on.) JSON-LD stores information in a tree that may be deeply nested: entities can refer to other entities. The pattern lets the filtering move through the tree and keep looking for potential matches.

results from jq play when filtering by name

Finally, let’s make use of the structural information encoded with the @ symbol. Because lots of different entities have names, we also want to know the type of entity something is. Is the “Chicago Symphony” the name of a symphonic work, or the name of an orchestra? In JSON-LD, the type of entity is indicated with the @type property. We can use jq to find what types of entities are include in the metadata. To do this, the filter would be ..|."@type"? . It follows the same ..|.foo? pattern, except that structural properties that have a @ prefix need to be within quotes, because ordinary JSON doesn’t use the @ prefix and so jq doesn’t recognize it unless it’s in quotes.

When we use this filter for an Event, we learn that the statement covers the following types of entities:

  • “MusicEvent”
  • “MusicVenue”
  • “Offer”
  • “MusicGroup”
  • “Person”
  • “CreativeWork”

That one simple query reveals a lot about what is included. We can confirm that the star of the show (type Person) is included in the metadata. If not, we know to add the name of the conductor.

Explore Further

I’m unable here to go into the details of how JSON-LD and schema.org metadata statements are constructed — though I do cover these basics in my book. To use jq in an audit, you will need some basic knowledge of important schema.org entities and properties, and know how JSON-LD creates objects (the curly braces) and lists (the brackets). If you don’t know these things yet, they can be learned easily.

The patterns in jq can be sophisticated, but at times, they can be fussy to wrangle. JSON-LD statements are frequently richer and more complex than simple statements in plain JSON. If you want to extract some specific information within JSON-LD, don’t hesitate to ask a friendly developer to help you set up a filter. Once you have the pattern, you can reuse it to retrieve similar information.

JSON-LD is still fairly new. Hopefully, purpose-built tools will emerge to help with auditing JSON-LD metadata. Until then, jq provides a light weight option for exploring JSON-LD statements.

— Michael Andrews