Author: Michael Andrews

Structural Metadata: Key to Structured Content

Post author By Michael Andrews
Post date October 11, 2017

Structural metadata is the most misunderstood form of metadata. It is widely ignored, even among those who work with metadata. When it is discussed, it gets confused with other things. Even people who understand structural metadata correctly don’t always appreciate its full potential. That’s unfortunate, because structural metadata can make content more powerful. This post takes a deep dive into what structural metadata is, what it does, and how it is changing.

Why should you care about structural metadata? The immediate, self-interested answer is that structural metadata facilitates content reuse, taking content that’s already created to deliver new content. Content reuse is nice for publishers, but it isn’t a big deal for audiences. Audiences don’t care how hard it is for the publisher to create their content. Audiences want content that matches their needs precisely, and that’s easy to use. Structural metadata can help with that too.

Structural metadata matches content with the needs of audiences. Content delivery can evolve beyond creating many variations of content — the current preoccupation of many publishers. Publishers can use structural metadata to deliver more interactive content experiences. Structural metadata will be pivotal in the development of multimodal content, allowing new forms of interaction, such as voice interaction. Well-described chunks of content are like well-described buttons, sliders and other forms of interactive web elements. The only difference is that they are more interesting. They have something to say.

Some of the following material will assume background knowledge about metadata. If you need more context, consult my very approachable book, Metadata Basics for Web Content.

What is Structural Metadata?

Structural metadata is data about the structure of content. In some ways it is not mysterious at all. Every time you write a paragraph, and enclose it within a
<p> paragraph element, you’ve created some structural metadata. But structural metadata entails far more than basic HTML tagging. It gives data to machines on how to deliver the content to audiences. When structural metadata is considered as a fancy name for HTML tagging, much of its potency gets missed.

The concept of structural metadata originated in the library and records management field around 20 years ago. To understand where structural metadata is heading, it pays to look at how it has been defined already.

In 1996, a metadata initiative known as the Warwick Framework first identified structural metadata as “data defining the logical components of complex or compound objects and how to access those components.”

In 2001, a group of archivists, who need to keep track of the relationships between different items of content, came up with a succinct definition: “Structural metadata can be thought of as the glue that binds compound objects together.”

By 2004, the National Information Standards Organization (NISO) was talking about structural metadata in their standards. According to their definition in the z39.18 standard, “Structural metadata explain the relationship between parts of multipart objects and enhance internal navigation. Such metadata include a table of contents or list of figures and tables.”

Louis Rosenfeld and Peter Morville introduced the concept of structural metadata to the web community in their popular book, Information Architecture for the World Wide Web — the “Polar Bear” book. Rosenfeld and Morville use the structural metadata concept as a prompt to define the information architecture of a websites:

“Describe the information hierarchy of this object. Is there a title? Are there discrete sections or chunks of content? Might users want to independently access these chunks?”

A big theme of all these definitions is the value of breaking content into parts. The bigger the content, the more it needs breaking down. The structural metadata for a book relates to its components: the table of contents, the chapters, parts, index and so on. It helps us understand what kinds of material is within the book, to access specific sections of the book, even if it doesn’t tell us all the specific things the book discusses. This is important information, which surprisingly, wasn’t captured when Google undertook their massive book digitization initiative a number of years ago. When the books were scanned, entire books became one big file, like a PDF. To find a specific figure or table within book on Google books requires searching or scrolling to navigate through the book.

Image of Google Books webpage. — The contents of scanned books in Google Books lack structural metadata, limiting the value of the content.

Navigation is an important purpose of structural metadata: to access specific content, such as a specific book chapter. But structural metadata has an even more important purpose than making big content more manageable. It can unbundle the content, so that the content doesn’t need to stay together. People don’t want to start with the whole book and then navigate through it to get to a small part in which they are interested. They want only that part.

In his recent book Metadata, Richard Gartner touches on a more current role for structural metadata: “it defines structures that bring together simpler components into something larger that has meaning to a user.” He adds that such information “builds links between small pieces of data to assemble them into a more complex object.”

In web content, structural metadata plays an important role assembling content. When content is unbundled, it can be rebundled in various ways. Structural metadata identifies the components within content types. It indicates role of the content, such as whether the content is an introduction or a summary.

Structural metadata plays a different role today than it did in the past, when the assumption was that there was one fixed piece of large content that would be broken into smaller parts, identified by structural metadata. Today, we may compose many larger content items, leveraging structural metadata, from smaller parts.

The idea of assembling content from smaller parts has been promoted in particular by DITA evangelists such as Anne Rockley (DITA is a widely used framework for technical documentation). Rockley uses the phrase “semantic structures” to refer to structural metadata, which she says “enable(s) us to understand ‘what’ types of content are contained within the documents and other content types we create.” Rockley’s discussion helpfully makes reference to content types, which some other definitions don’t explicitly mention. She also introduces another concept with a similar sounding name, “semantically rich” content, to refer to a different kind of metadata: descriptive metadata. In XML (which is used to represent DITA), the term semantic is used generically for any element. Yet the difference between structural and descriptive metadata is significant — though it is often obscured, especially in the XML syntax.

Curiously, semantic web developments haven’t focused much on structural metadata for content (though I see a few indications that this is starting to change). Never assume that when someone talks about making content semantic, they are talking about adding structural metadata.

Don’t Confuse Structural and Descriptive Metadata

When information professionals refer to metadata, most often they are talking about descriptive metadata concerning people, places, things, and events. Descriptive metadata indicates the key information included within the content. It typically describes the subject matter of the content, and is sometimes detailed and extensive. It helps one discover what the content is about, prior to viewing the content. Traditionally, descriptive metadata was about creating an external index — a proxy — such as assigning a keywords or subject headings about the content. Over the past 20 years, descriptive metadata has evolved to describing the body of the content in detail, noting entities and their properties.

Richard Gartner refers to descriptive metadata as “finding metadata”: it locates content that contains some specific information. In modern web technology, it means finding values for a specific field (or property). These values are part of the content, rather than separate from it. For example, find smartphones with dual SIMs that are under $400. The attributes of SIM capacity and price are descriptive metadata related to the content describing the smartphones.

Structural metadata indicates how people and machines can use the content. If people see a link indicating a slideshow, they have an expectation of how such content will behave, and will decide if that’s the sort of content they are interested in. If a machine sees that the content is a table, it uses that knowledge to format the content appropriately on a smartphone, so that all the columns are visible. Machines rely extensively on structural metadata when stitching together different content components into a larger content item.

diagram showing relationship of structural and descriptive metadata — Structural and descriptive metadata can be indicated in the same HTML tag. This tag indicates the start of an introductory section discussing Albert Einstein.

Structural metadata sometimes is confused with descriptive metadata because many people use vague terms such as “structure” and “semantics” when discussing content. Some people erroneously believe that structuring content makes the content “semantic”. Part of this confusion derives from having an XML-orientation toward content. XML tags content with angle-bracketed elements. But XML elements can be either structures such as sections, or they can be descriptions such as names. Unlike HTML, where elements signify content structure while descriptions are indicated in attributes, the XML syntax creates a monster hierarchical tree, where content with all kinds of roles are nested within elements. The motley, unpredictable use of elements in XML is a major reason it is unpopular with developers, who have trouble seeing what roles different parts of the content have.

The buzzword “semantically structured content” is particularly unhelpful, as it conflates two different ideas together: semantics, or what content means, with structure, or how content fits together. The semantics of the content is indicated by descriptive metadata, while the structure of the content is indicated by structural metadata. Descriptive metadata can focus on a small detail in the content, such as a name or concept (e.g., here’s a mention of the Federal Reserve Board chair in this article). Structural metadata, in contrast, generally focuses on a bigger chunk of content: here’s a table, here’s a sidebar. To assemble content, machines need to distinguish what the specific content means, from what the structure of the content means.

Interest in content modeling has grown recently, spurred by the desire to reuse content in different contexts. Unfortunately, most content models I’ve seen don’t address metadata at all; they just assume that the content can be pieced together. The models almost never distinguish between the properties of different entities (descriptive metadata), and the properties of different content types (structural metadata). This can lead to confusion. For example, a place has an address, and that address can be used in many kinds of content. You may have specific content types dedicated to discussing places (perhaps tourist destinations) and want to include address information. Alternatively, you may need to include the address information in content types that are focused on other purposes, such as a membership list. Unless you make a clear distinction in the content model between what’s descriptive metadata about entities, and what’s structural metadata about content types, many people will be inclined to think there is a one-to-one correspondence between entities and content types, for example, all addresses belong the the content type discussing tourist destinations.

Structural metadata isn’t merely a technical issue to hand off to a developer. Everyone on a content team who is involved with defining what content gets delivered to audiences, needs to jointly define what structural metadata to include in the content.

Three More Reasons Structural Metadata Gets Ignored…

Content strategists have inherited frameworks for working with metadata from librarians, database experts and developers. None of those roles involves creating content, and their perspective of content is an external one, rather than an internal one. These hand-me-down concepts don’t fit the needs of online content creators and publishers very well. It’s important not to be misled by legacy ideas about structural metadata that were developed by people who aren’t content creators and publishers. Structural metadata gets sidelined when people fail to focus on the value that content parts can contribute in different scenarios.

Reason 1: Focus on Whole Object Metadata

Librarians have given little attention to structural metadata, because they’ve been most concerned with cataloging and locating things that have well defined boundaries, such as books and articles (and most recently, webpages). Discussion of structural metadata in library science literature is sparse compared with discussions of descriptive and administrative metadata.

Until recently, structural metadata has focused on identifying parts within a whole. Metadata specialists assumed that a complete content item existed (a book or document), and that structural metadata would be used to locate parts within the content. Specifying structural metadata was part of cataloging existing materials. But given the availability of free text searching and more recently natural language processing, many developers question the necessity of adding metadata to sub-divide a document. Coding structural metadata seemed like a luxury, and got ignored.

In today’s web, content exists as fragments that can be assembled in various ways. A document or other content type is a virtual construct, awaiting components. The structural metadata forms part of the plan for how the content can fit together. It’s important to define the pieces first.

Reason 2: Confusion with Metadata Schemas

I’ve recently seen several cases where content strategists and others mix up the concept of structural metadata, with the concept of metadata structure, better known as metadata schemas. At first I thought this confusion was simply the result of similar sounding terms. But I’ve come to realize that some database experts refer to structural metadata in a different way than it is being used by librarians, information architects, and content engineers. Some content strategists seem to have picked up this alternative meaning, and repeat it.

Compared to semi-structured web content, databases are highly regular in structure. They are composed of tables of rows and columns. The first column of a row typically identifies what the values relate to. Some database admins refer to those keys or properties as the structure of the data, or the structural metadata. For example, the OECD, the international statistical organization, says: “Structural metadata refers to metadata that act as identifiers and descriptors of the data. Structural metadata are needed to identify, use, and process data matrixes and data cubes.” What is actually being referred to is the schema of the data table.

Database architects develop many custom schemas to organize their data in tables. Those schemas are very different from the standards-based structural metadata used in content. Database tables provide little guidance on how content should be structured. Content teams shouldn’t rely on a database expert to guide them on how to structure their content.

Reason 3: Treated as Ordinary Code

Web content management systems are essentially big databases built in programming language like PHP or .Net. There’s a proclivity among developers to treat chunks of content as custom variables. As one developer noted when discussing WordPress: “In WordPress (WP), the meaning of Metadata is a bit fuzzier. It stores post metadata such as custom fields and additional metadata added via plugins.”

As I’ve noted elsewhere, many IT systems that manage content ignore web metadata standards, resulting in silos of content that can’t work together. It’s not acceptable to define chunks of content as custom variables. The purpose of structural metadata is to allow different chunks of content to connect with each other. CMSs need to rely on web standards for their structural metadata.

Current Practices for Structural Metadata

For machines to piece together content components into a coherent whole, they need to know the standards for the structural metadata.

Until recently, structural metadata has been indicated only during the prepublication phase, an internal operation where standards were less important. Structural metadata was marked up in XML together with other kinds of metadata, and transformed into HTML or PDF. Yet a study in the journal Semantic Web last year noted: “Unfortunately, the number of distinct vocabularies adopted by publishers to describe these requirements is quite large, expressed in bespoke document type definitions (DTDs). There is thus a need to integrate these different languages into a single, unifying framework that may be used for all content.”

XML continues to be used in many situations. But a recent trend has been to adopt more light weight approaches, using HTML, to publish content directly. Bypassing XML is often simpler, though the plainness of HTML creates some issues as well.

As Jeff Eaton has noted, getting specific about the structure of content using HTML elements is not always easy:

“We have workhorse elements like ul, div, and span; precision tools like cite, table, and figure; and new HTML5 container elements like section, aside, and nav. But unless our content is really as simple as an unattributed block quote or a floated image, we still need layers of nested elements and CSS classes to capture what we really mean.”

Because HTML elements are not very specific, publishers often don’t know how to represent structural metadata within HTML. We can learn from the experience of publishers who have used XML to indicate structure, and who are adapting their structures to HTML.

Scientific research, and technical documentation are two genres where content structure is well-established, and structural metadata is mature. Both these genres have explored how to indicate the structure of their content in HTML.

Scientific research papers are a distinct content type that follows a regular pattern. The National Library of Medicine’s Journal Article Tag Suite (JATS) formalizes the research paper structure into a content type as an XML schema. It provides a mixture of structural and descriptive metadata tags that are used to publish biomedical and other scientific research. The structure might look like:

<sec sec-type="intro">

<sec sec-type="materials|methods">

<sec sec-type="results">

<sec sec-type="discussion">

<sec sec-type="conclusions">

<sec sec-type="supplementary-material" ... >

Scholarly HTML is an initiative to translate the typical sections of a research paper into common HTML. It uses HTML elements, and supplements them with typeof attributes to indicate more specifically the role of each section. Here’s an example of some attribute values in their namespace, noted by the prefix “sa”:

<section typeof="sa:MaterialsAndMethods">

<section typeof="sa:Results">

<section typeof="sa:Conclusion">

<section typeof="sa:Acknowledgements">

<section typeof="sa:ReferenceList">

As we can see, these sections overlap with the JATS, since both are describing similar content structures. The Scholarly HTML initiative is still under development, and it could eventually become a part of the schema.org effort.

DITA — the technical documentation architecture mentioned earlier — is a structural metadata framework that embeds some descriptive metadata. DITA structures topics, which can be different information types: Task, Concept, Reference, Glossary Entry, or Troubleshooting, for example. Each type is broken into structural elements, such as title, short description, prolog, body, and related links. DITA is defined in XML, and uses many idiosyncratic tags.

HDITA is a draft syntax to express DITA in HTML. It converts DITA-specific elements into HTML attributes, using the custom data-* attribute. For example a “key definition” element <keydef> becomes an attribute within an HTML element, e.g. <div data-hd-class="keydef”>
. Types are expressed with the attribute data-hd-type.

The use of the data-* offers some advantages, such as javascript access by clients. It is not, however, intended for use as a cross-publisher metadata standard. The W3C notes: “A custom data attribute is an attribute in no namespace…intended to store custom data private to the page or application.” It adds:

“These attributes are not intended for use by software that is not known to the administrators of the site that uses the attributes. For generic extensions that are to be used by multiple independent tools, either this specification should be extended to provide the feature explicitly, or a technology like microdata should be used (with a standardized vocabulary).”

The HDITA drafting committee appears to use “hd” in the data attribute to signify that the attribute is specific to HDITA. But they have not declared a namespace for these attributes (the XML namespace for DITA is xmlns:ditaarch.) This will prevent automatic machine discovery of the metadata by Google or other parties.

The Future of Structural Metadata

Most recently, several initiatives have explored possibilities for extending structural metadata in HTML. These revolve around three distinct approaches:

Formalizing structural metadata as properties
Using WAI-ARIA to indicate structure
Combining class attributes with other metadata schemas

New Vocabularies for Structures

The web standards community is starting to show more interest in structural metadata. Earlier this year, the W3C released the Web Annotation Vocabulary. It provides properties to indicate comments about content. Comments are an important structure in web content that are used in many genres and scenarios. Imagine that readers may be highlighting passages of text. For such annotations to be captured, there must be a way to indicate what part of the text is being referenced. The annotation vocabulary can reference specific HTML elements and even CSS selectors within a body of text.

Outside of the W3C, a European academic group has developed the Document Components Ontology (DoCO), “a general-purpose structured vocabulary of document elements.” It is a detailed set of properties for describing common structural features of text content. The DoCO vocabulary can be used by anyone, though its initial adoption will likely be limited to research-oriented publishers. However, many specialized vocabularies such as this one have become extensions to schema.org. If DoCO were in some form adsorbed by schema.org, its usage would increase dramatically.

Diagram showing document ontology — Diagram showing document components ontology

WAI-ARIA

WAI-ARIA is commonly thought of as a means to make functionality accessible. However, it should be considered more broadly as a means to enhance the functionality of web content overall, since it helps web agents understand the intentions of the content. WAI-ARIA can indicate many dynamic content structures, such as alerts, feeds, marquees, and regions.

The new Digital Publishing WAI-ARIA developed out of the ePub standards, which have a richer set of structural metadata than is available in standard HTML5. The goal of the Digital Publishing WAI-ARIA is to “produce structural semantic extensions to accommodate the digital publishing industry”. It has the following structural attributes:

doc-abstract
doc-acknowledgments
doc-afterword
doc-appendix
doc-backlink
doc-biblioentry
doc-bibliography
doc-biblioref
doc-chapter
doc-colophon
doc-conclusion
doc-cover
doc-credit
doc-credits
doc-dedication
doc-endnote
doc-endnotes
doc-epigraph
doc-epilogue
doc-errata
doc-example
doc-footnote
doc-foreword
doc-glossary
doc-glossref
doc-index
doc-introduction
doc-noteref
doc-notice
doc-pagebreak
doc-pagelist
doc-part
doc-preface
doc-prologue
doc-pullquote
doc-qna
doc-subtitle
doc-tip
doc-toc

To indicate an the structure of a text box showing an example:

<aside role="doc-example">

<h1>An Example of Structural Metadata in WAI-ARIA</h1>

…

</aside>

Content expressing a warning might look like this:

<div role="doc-notice" aria-label="Explosion Risk">

<p><em>Danger!</em> Mixing reactive materials may cause an explosion.</p>

</div>

Although book-focused, DOC-ARIA roles provide a rich set of structural elements that can be used with many kinds of content. In combination with the core WAI-ARIA, these attributes can describe the structure of web content in extensive detail.

CSS as Structure

For a long while, developers have been creating pseudo structures using CSS, such as making infoboxes to enclose certain information. Class is a global attribute of HTML, but has become closely associated with CSS, so much so that some believe that is its only purpose. Yet Wikipedia notes: “The class attribute provides a way of classifying similar elements. This can be used for semantic purposes, or for presentation purposes.” Some developers use what are called “semantic classes” to indicate what content is about. The W3C advises when using the class attribute: “authors are encouraged to use values that describe the nature of the content, rather than values that describe the desired presentation of the content.”

Some developers claim that the class attribute should never be used to indicate the meaning of content within an element, because HTML elements will always make that clear. I agree that web content should never use the class attribute as a substitute for using a meaningful HTML element. But the class attribute can sometimes further refine the meaning of an HTML element. Its chief limitation is that class names involve private meanings. Yet if they are self-describing they can be useful.

Class attributes are useful for selecting content, but they operate outside of metadata standards. However, schema.org is proposing a property that will allow class values to be specified within schema.org metadata. This has potentially significant implications for extending the scope of structural metadata.

The motivating use case is as follows: “There is a need for authors and publishers to be able to easily call out portions of a Web page that are particularly appropriate for reading out aloud. Such read-aloud functionality may vary from speaking a short title and summary, to speaking a few key sections of a page; in some cases, it may amount to speaking most non-visual content on the page.”

The pending cssSelector property in schema.org can identify named portions of a web page. The class could be a structure such as a summary or a headline that would be more specific than an HTML element. The cssSelector has a companion property called xpath, which identifies HTML elements positionally, such as the paragraphs after h2 headings.

These features are not yet fully defined. In addition to indicating speakable content, the cssSelector can indicate parts of a web page. According to a Github discussion: “The ‘cssSelector’ (and ‘xpath’) property would be particularly useful on http://schema.org/WebPageElement to indicate the part(s) of a page matching the selector / xpath. Note that this isn’t ‘element’ in some formal XML sense, and that the selector might match multiple XML/HTML elements if it is a CSS class selector.” This could be useful selecting content targeted at specific devices.

The class attribute can identify structures within the web content, working together with entity-focused properties that describe specific data relating to the content. Both of these indicate content variables, but they deliver different benefits.

Entity-based (descriptive) metadata can be used for content variables about specific information. They will often serve as text or numeric variables. Use descriptive metadata variables when choosing what informational details to put in a message.

Structural metadata can be used phrase-based variables, indicating reusable components. Phrases can be either blocks (paragraphs or divs), or snippets (a span). Use structural metadata variables when choosing the wording to convey a message in a given scenario.

A final interesting point about cssSelector’s in schema.org. Like other properties in schema.org, these can be expressed either as inline markup in HTML (microdata) or as an external JSON-LD script. This gives developers the flexibility to choose whether to use coding libraries that are optimized for arrays (JSON-flavored), or ones focus on selectors. For too long, what metadata gets included has been influenced by developer preferences in coding libraries. The fact that CSS attributes can be expressed as JSON suggests that hurdle is being transcended.

Conclusion

Structural metadata is finally getting some love in the standards community, even though awareness of it remains low among developers. I hope that content teams will consider how they can use structural metadata to be more precise in indicating what their content does, so that it can be used flexibly in emerging scenarios such as voice interactions.

— Michael Andrews

Tags metadata

Storytelling

The Lazy Person’s Guide to Text Wrangling

Post author By Michael Andrews
Post date October 4, 2017

Writing in plain text is increasingly popular. Writers of all kinds are adopting Markdown, the plain text writing format. They swear by the zen-like benefits of writing in plain text. WYSIWYG isn’t cool any more. Yet little attention has been given to how to work with plain text when editing material originating from many different sources. The growing popularity of plain text opens new opportunities for the creative reuse of text content, because plain text is inherently portable, able to move between different applications easily. This post will describe how to edit plain text content when different sources provide raw text that is used to develop new content.

Reusing and repurposing text is helpful in many situations. For a personal side project, I’ve been exploring content design options, using representative content, for a prototype. I want to reuse public information from different sources and in different formats (e.g., lists, tables, descriptive paragraphs). This content offers possibilities to combine and remix information to highlight specific themes. Even if I were a fast and accurate typist, retyping large volumes of text that already exists is not desirable or feasible. Cutting and pasting text is tedious, especially if the text is formatted. I wanted better tools to manipulate text encoded in different formats, and change the structure of the content.

Digital text is generally not plain text. Digital formats can, however, be converted to some flavor of plain text. Content designers may acquire digital text that exists as HTML, as CSV (a barebones spreadsheet), and even as PDF. Each format involves implicit forms of structure, at various levels of granularity. CSV assumes tabular information. Plain text assumes linear information. HTML can describe text content that contains different levels of information:

Names, words, dates, numbers and other discrete strings
Phrases that combine several strings together
Sentences
Paragraphs
Headings and other structural elements

We can edit and fine tune all these levels using plain text. Text wrangling makes it possible.

What is Text Wrangling?

Text wrangling converts and transforms information at different levels of granularity. For example, information in a list could be converted into a table, or vice versa. Text wrangling can restructure and renarrate information. It can also clean up content from different sources, such as standardizing spelling or wording.

Word processors (Word, iA writer, Scrivener, etc.) are designed for people writing fresh content. They aren’t designed to support the reuse and repurposing of existing content. When trying to manipulate text, word processors are rather clumsy. Word processors fail when wanting to:

Generate content variations
Explore alternative wordings
Make multiple changes simultaneously
Ingest content from different sources that may be in different formats
Clean up text acquired from different sources.

Text wrangling differs from normal editing. Instead of editing a single document, text wrangling involves gathering text from many sources, and rewriting and consolidating that text into a unified document or content repository. Text wrangling applies large scale changes to text, by automating some low level transformations. It uses functionality available in different applications to reduce typing and cut-and-paste operations. This editing occurs during a “pre-drafting” phase, before the text evolves into a readable “draft”. Editors can wrangle “raw” text fragments, to define themes and structures, and unify editorial consistency.

Tools for Text Wrangling

Many applications have useful features to wrangle plain text. Ironically, none of these applications was designed for writers; most were designed for coders or data geeks. As writing in plain text becomes more popular (in Markdown, Textile, AsciiDoc, or reStructuredText), more people are using coding tools to write. These tools have enhanced editing features lacking in word processors, particularly the many “distraction free” apps designed for writing in Markdown.

Because none of the wrangling applications was designed specifically for text prose, no one application does everything I want. I use a combination of tools, and switch between them, depending on which is easiest to use for a specific purpose. That sounds complicated, but it isn’t. Plain text can be opened in many applications, and can be copied easily between them.

I use three kinds of tools to rework text:

Spreadsheets (Google Sheets, Excel)
Text editors that are primarily designed for coders (TextWrangler, Brackets, Sublime Text)
Global utilities that are available to use within any application (TextSoap, Paste).

Before I share some tips, a few caveats (remember, I promised a lazy approach). These tips are not a comprehensive review of available apps and functionality. Other apps provide alternative approaches, and some will be unknown to me. Because I use a Mac, my experience is limited to that platform My preferences are motivated by a desire to find an easy way to perform a text task, without needing to learn anything fiddly. Most developers would use something called Regex scripting to clean text, which is a powerful option for those comfortable with scripting. I’ve opted for a quick and dirty approach, even if it is occasionally a messy one. Lastly, apologies to my tech writer friends for my bloggerly presentation — I’m assuming everyone can locate more specific instructions elsewhere. You’ll learn more from a Google search than I can provide you through a single link.

General Approach to Wrangling Plain Text

The basic approach to text wrangling is to work with lines of text. Text editors organize text by line, as do spreadsheets (which call them rows). To take advantage of the functionality these tools offer, we convert text into lines. Everything, whether a word, a sentence, a paragraph or a heading, can become a line that can be transformed. Lines can be split, and they can be joined, to form different levels of meaning.

All text can be considered as a line, that can be transformed into different structures — All text can be considered as a line, which can be transformed into different structures

Some lines of text are just words or phrases — these lines are lists. Some lines are complete sentences. Working with sentences on separate lines is flexible. It is easier to perform operations on sentences, such as changing the capitalization, when each sentence is on a separate line. It is also easier to reorder sentences. When finished with editing, individual sentences can be joined together into paragraphs. Brackets has a “Join Lines” function (an extension) that makes it easy. Just highlight all the lines you want joined together, and they become a single line of several sentences forming a paragraph.

Stripping out HTML and other markup

The first task is to get the content into plain text. Working in plain text makes it easy to focus on the text. If you have text content that’s encoded in HTML, you’ll want to get it into plain text, without all the distracting markup. Even if you are comfortable reading HTML, you’ll find CSS and Javascript markup that’s irrelevant to the text.

Sometimes you can get plain text by selecting the “Reader View” in your browser, and copying or emailing the text and saving it. Alternatively, you may be able to acquire tabular or structured text on websites from within Google Sheets, using “ImportXML” or “ImportHTML”. These functions take a moment to learn, but can be very helpful when you need to get a little bit of text from many different webpages.

When you are working with text files instead of live content, you want a way to directly clean the text without having to first view it in a browser. Open the file in text editor, highlight text, and use TextSoap to strip out the HTML markup. TextSoap is a handy app that can clean text, and can be used with any Mac application.

Breaking Apart Phrases

Phrases are the foundation of sentences, labels, and headings. If you need to wordsmith many words or phrases, it may be easiest to get these into a list, and work with them in a spreadsheet. A list is essentially a one column spreadsheet. By breaking apart the text in one column into different columns, you can modify different segments of the text, changing their order, standardizing wording and formatting, or extracting a sub-string of text within a longer string. A common, simple example relates to people’s names. Do you want to list an author’s name as a single phrase (“Ellen B. Smith”)? Or would you like to separate given name(s) and family name? What order to you want the given names and surnames?

Spreadsheets can split or extract the text in one column into one or more new columns. You can create a new column for each distinct word, which allows you to group-edit distinct words. Or you may want to extract and put into a new column what’s distinctive or unique in each line.

Google Sheets has a function called “Split” that breaks the text in a column into separate columns. When words are in separate columns, it can be easy to make changes to specific words. “Substitute” allows specific words to be swapped. “Replace” allows a sub-string to be changed.

Consolidating Phrases

Spreadsheets are good not just for breaking apart words, but for combining them. You can take words in different columns on the same row and combine them together using “Join.” “Concatenate” allows words from different columns and different rows to be combined. This is an even more flexible option, because it lets you try unlimited combinations of words in different cells. For example, you could play with different word hierarchies (broader or narrower words on different rows within a single column), or array a range of related verbs or adjectives across a singe row in different columns. “Concatenate” can enable simple sentence generation.

A different situation occurs when you want to take information that’s within a table, and express it as a list. A matrix table will have a column header, a row header, and a value associated with the column-row combination. Suppose you have a table listing rainfall, with the columns representing months, and rows representing years, and the cell representing the amount of rainfall. This information can be transformed into a single row or line. In Excel, a function called “Unpivot” does this (Google Sheets lacks this functionality). It presents all the information in a single row, such as “May | 2017| 2 cm”, which can be joined together. These values can be transformed further into a list of complete sentences, such as “In May 2017, the rainfall was 2 cm”. That list of sentences could become the beginning of separate paragraphs that discuss the implications of each month’s rainfall on the local economy.

Removing Redundancy

It’s helpful to put each discrete idea in the text on a separate line. These may be names of topics, phrases, or facts. During the text development phase, you’ll want to collect all text strings of interest. Text wrangling tools can help you collect everything of potential interest, and worry later about whether you’ve already covered these items.

Suppose you need content about all of your products. You can create a list of all your products, with each product on a separate line. If you have many product variations that sound similar, it can be confusing to know if it’s already in the list. If the list is a spreadsheet, it is easy to remove duplicates. If the list is a text file, TextWrangler has a feature called “Kill Duplicates”. To spot near-duplicates, sorting the lines will often reveal suspiciously-similar items.

Spotting duplicate or redundant paragraphs takes an extra step. To compare alternative paragraphs, put them in separate files. TextWrangler allows you to compare two files using the function “Find Differences”. Both files are displayed side-by-side, with their differences highlighted. This approach is more flexible than a word processor, which can assume you want to merge documents, or choose which document is the right one.

Harmonizing Style

A big task when assembling text from many sources is harmonizing the style. Different texts may use different terminology.

Text editors, like word processors, have “Find and Replace” functionality, but they offer even better tools. Find and Replace is inefficient because it assumes you have one word you already know you need to replace with another word. Suppose instead you have many different words referring to the same concept? Suppose you aren’t sure what would be the best replacement? This is where the magic of “multiple selections” comes into play.

Brackets’ Multiple Selections feature can let you make many edits at once. (Sublime Text has a similar feature). All you need to do is highlight all the words that you want to change. Then, you type over all the highlighted text at once, and see the changes happen as you type. Words change on many lines at once, and you can try out different text to decide which works best across different sentences. And to repeat: the highlighted words being changed don’t need to be the same word. If you have some sentences that talk about dogs, some sentences that talk about canines, and some sentences that talk about mutts, you can highlight all these words (dogs, canines, mutts), and change them all to “hounds” — before deciding to say “man’s best friend” instead.

Brackets has a related feature called “Multiple Cursors” that is also amazing. It allows you to place your cursor on multiple lines, and edit multiple lines at once. Suppose you want to decide the best construction for some headings. You want to know if saying “{Product X} helps you {Benefit Y}” is best, or “{Product X} makes it possible to {Benefit Y}”. You list all the products and their respective benefits on separate lines. Then you can edit all the headings at once, and try out each variation to see which sounds best.

Shifting Perspective

Many wording changes involve changing voice, or flipping emphasis. Do you want to discuss a task using the imperative “invest” to emphasize action, or by using the gerund “investing” to emphasize a series of activities? If you have many such tasks, you might want to put them in a list of statements, and try out both options. You can then decide on a consistent approach.

In addition to the multiple cursor approach, you can edit multiple lines of text using the “Prefix/Suffix” functionality available in TextWrangler. This allows you to either insert or remove either a prefix or a suffix to a line. This could be useful with deciding on the wording of headings. Maybe you want to see what the headings would sound like if they begin “Case Study: ” or whether they should end with “(Case Study)”.

Skeleton Frameworks

Plain text tools can help you reuse text elements again and again. This can be useful if you have a template or framework you are using to collect text.

Sublime Text has a feature called “Snippets”, where you can store any text you want to reuse, and inject it into any file you are working with.

Another option is a small utility called Paste, which works with any application on a Mac. It is like a huge clipboard, where you can store large snippets of text, give these snippets names, and reuse them wherever you may need them.

Adding Markup to Plain Text

Plain text is great for writing and editing. But eventually it will need some markup to become more useful. Several options are available to turn plain text into web text.

Many writers have adopted Markdown. You can add Markdown syntax to the text, and convert the text to HTML.

You can also add basic HTML elements to plain text using TextSoap, which is a utility that can be used in any Mac application. You simply highlight the words you want to tag, and choose the HTML element you want to use. This option may be desirable if you need elements that aren’t well supported in Markdown.

The most robust option is to use the tagging functionality available in some text editors. You can add markup using Brackets’s “Surround” extension, where you highlight your text, then define any tagging you want to place around the text. Sublime Text has a similar feature: “Tag > Wrap Selection.” These features let you add metadata beyond simple HTML elements; for example, to indicate in what language a phrase is.

Limitations of Text Wrangling

Text wrangling techniques can be handy in many situations, but will be inefficient in others. They are intended for early content development work. As I’ve discovered, they can be helpful for assembling text to prototype content.

These techniques aren’t efficient for editing massive content repositories, or editing single documents that aren’t very long. If you need to migrate large volumes of content, you’ll want some custom scripts written to transform that content appropriately.

Text wrangling focuses on redrafting raw text, rather than collaboration, which will generally get delegated to another platform such as GitHub. Word processing apps offer better support for the review of well-defined drafts, where comments and change tracking are important. If you are editing or rewriting individual documents, especially in collaboration with others, tools such as Google Docs that track comments will be a better option.

Someday, I hope someone will develop the perfect tool to edit text. Until then, using a combination of tools is the best option.

—Michael Andrews

Tags text wrangling