Tag Archives: text wrangling

The Lazy Person’s Guide to Text Wrangling

Writing in plain text is increasingly popular. Writers of all kinds are adopting Markdown, the plain text writing format. They swear by the zen-like benefits of writing in plain text. WYSIWYG isn’t cool any more. Yet little attention has been given to how to work with plain text when editing material originating from many different sources.  The growing popularity of plain text opens new opportunities for the creative reuse of text content, because plain text is inherently portable, able to move between different applications easily. This post will describe how to edit plain text content when different sources provide raw text that is used to develop new content.

Reusing and repurposing text is helpful in many situations.  For a personal side project, I’ve been exploring content design options, using representative content, for a prototype.  I want to reuse public information from different sources and in different formats (e.g., lists, tables, descriptive paragraphs). This content offers possibilities to combine and remix information to highlight specific themes. Even if I were a fast and accurate typist, retyping large volumes of text that already exists is not desirable or feasible.  Cutting and pasting text is tedious, especially if the text is formatted.  I wanted better tools to manipulate text encoded in different formats, and change the structure of the content.

Digital text is generally not plain text.  Digital formats can, however, be converted to some flavor of plain text.  Content designers may acquire digital text that exists as HTML, as CSV (a barebones spreadsheet), and even as PDF.  Each format involves implicit forms of structure, at various levels of granularity.  CSV assumes tabular information.  Plain text assumes linear information.  HTML can describe text content that contains different levels of information:

  • Names, words, dates, numbers and other discrete strings
  • Phrases that combine several strings together
  • Sentences
  • Paragraphs
  • Headings and other structural elements

We can edit and fine tune all these levels using plain text.  Text wrangling makes it possible.

What is Text Wrangling?

Text wrangling converts and transforms information at different levels of granularity.  For example, information in a list could be converted into a table, or vice versa.  Text wrangling can restructure and renarrate information. It can also clean up content from different sources, such as standardizing spelling or wording.

Word processors (Word, iA writer, Scrivener, etc.) are designed for people writing fresh content.  They aren’t designed to support the reuse and repurposing of existing content.  When trying to manipulate text, word processors are rather clumsy.  Word processors fail when wanting to:

  • Generate content variations
  • Explore alternative wordings
  • Make multiple changes simultaneously
  • Ingest content from different sources that may be in different formats
  • Clean up text acquired from different sources.

Text wrangling differs from normal editing.  Instead of editing a single document, text wrangling involves gathering text from many sources, and rewriting and consolidating that text into a unified document or content repository.  Text wrangling applies large scale changes to text, by automating some low level transformations.  It uses functionality available in different applications to reduce typing and cut-and-paste operations.  This editing occurs during a “pre-drafting” phase, before the text evolves into a readable “draft”.  Editors can wrangle “raw” text fragments, to define themes and structures, and unify editorial consistency.

Tools for Text Wrangling

Many applications have useful features to wrangle plain text.   Ironically, none of these applications was designed for writers; most were designed for coders or data geeks.  As writing in plain text becomes more popular (in Markdown, Textile, AsciiDoc, or reStructuredText), more people are using coding tools to write.  These tools have enhanced editing features lacking in word processors, particularly the many “distraction free” apps designed for writing in Markdown.

Because none of the wrangling applications was designed specifically for text prose, no one application does everything I want. I use a combination of tools, and switch between them, depending on which is easiest to use for a specific purpose.  That sounds complicated, but it isn’t.  Plain text can be opened in many applications, and can be copied easily between them.

I use three kinds of tools to rework text:

  1. Spreadsheets (Google Sheets, Excel)
  2. Text editors that are primarily designed for coders (TextWrangler, Brackets, Sublime Text)
  3. Global utilities that are available to use within any application (TextSoap, Paste).

Before I share some tips, a few caveats (remember, I promised a lazy approach). These tips are not a comprehensive review of available apps and functionality.  Other apps provide alternative approaches, and some will be unknown to me. Because I use a Mac, my experience is limited to that platform  My preferences are motivated by a desire to find an easy way to perform a text task, without needing to learn anything fiddly.  Most developers would use something called Regex scripting to clean text, which is a powerful option for those comfortable with scripting.  I’ve opted for a quick and dirty approach, even if it is occasionally a messy one.  Lastly, apologies to my tech writer friends for my bloggerly presentation — I’m assuming everyone can locate more specific instructions elsewhere.   You’ll learn more from a Google search than I can provide you through a single link.

General Approach to Wrangling Plain Text

The basic approach to text wrangling is to work with lines of text.  Text editors organize text by line, as do spreadsheets (which call them rows). To take advantage of the functionality these tools offer, we convert text into lines.  Everything, whether a word, a sentence, a paragraph or a heading, can become a line that can be transformed.  Lines can be split, and they can be joined, to form different levels of meaning.

All text can be considered as a line, that can be transformed into different structures
All text can be considered as a line, which can be transformed into different structures

Some lines of text are just words or phrases — these lines are lists.  Some lines are complete sentences.  Working with sentences on separate lines is flexible.  It is easier to perform operations on sentences, such as changing the capitalization, when each sentence is on a separate line.  It is also easier to reorder sentences.  When finished with editing, individual sentences can be joined together into paragraphs.  Brackets has a “Join Lines” function (an extension) that makes it easy.  Just highlight all the lines you want joined together, and they become a single line of several sentences forming a paragraph.

Stripping out HTML and other markup

The first task is to get the content into plain text.  Working in plain text makes it easy to focus on the text.   If you have text content that’s encoded in HTML, you’ll want to get it into plain text, without all the distracting markup.  Even if you are comfortable reading HTML, you’ll find CSS and Javascript markup that’s irrelevant to the text.

Sometimes you can get plain text by selecting the “Reader View” in your browser, and copying or emailing the text and saving it.  Alternatively, you may be able to acquire tabular or structured text on websites from within Google Sheets, using “ImportXML” or “ImportHTML”.  These functions take a moment to learn, but can be very helpful when you need to get a little bit of text from many different webpages.

When you are working with text files instead of live content, you want a way to directly clean the text without having to first view it in a browser.  Open the file in text editor, highlight text, and use TextSoap to strip out the HTML markup.  TextSoap is a handy app that can clean text, and can be used with any Mac application.

Breaking Apart Phrases

Phrases are the foundation of sentences, labels, and headings.  If you need to wordsmith many words or phrases, it may be easiest to get these into a list, and work with them in a spreadsheet.  A list is essentially a one column spreadsheet.  By breaking apart the text in one column into different columns, you can modify different segments of the text, changing their order, standardizing wording and formatting, or extracting a sub-string of text within a longer string.  A common, simple example relates to people’s names.  Do you want to list an author’s name as a single phrase (“Ellen B. Smith”)? Or would you like to separate given name(s) and family name?  What order to you want the given names and surnames?

Spreadsheets can split or extract the text in one column into one or more new columns.  You can create a new column for each distinct word, which allows you to group-edit distinct words.  Or you may want to extract and put into a new column what’s distinctive or unique in each line.

Google Sheets has a function called “Split” that breaks the text in a column into separate columns.  When words are in separate columns, it can be easy to make changes to specific words.  “Substitute” allows specific words to be swapped.  “Replace” allows a sub-string to be changed.

Consolidating Phrases

Spreadsheets are good not just for breaking apart words, but for combining them.  You can take words in different columns on the same row and combine them together using “Join.”  “Concatenate” allows words from different columns and different rows to be combined.   This is an even more flexible option, because it lets you try unlimited combinations of words in different cells.  For example, you could play with different word hierarchies (broader or narrower words on different rows within a single column), or array a range of related verbs or adjectives across a singe row in different columns.  “Concatenate” can enable simple sentence generation.

A different situation occurs when you want to take information that’s within a table, and express it as a list.  A matrix table will have a column header, a row header, and a value associated with the column-row combination.  Suppose you have a table listing rainfall, with the columns representing months, and rows representing years, and the cell representing the amount of rainfall.  This information can be transformed into a single row or line.  In Excel, a function called “Unpivot” does this (Google Sheets lacks this functionality).  It presents all the information in a single row, such as “May | 2017| 2 cm”, which can be joined together.  These values can be transformed further into a list of complete sentences, such as “In May 2017, the rainfall was 2 cm”.  That list of sentences could become the beginning of separate paragraphs that discuss the implications of each month’s rainfall on the local economy.

Removing Redundancy

It’s helpful to put each discrete idea in the text on a separate line.  These may be names of topics, phrases, or facts.  During the text development phase, you’ll want to collect all text strings of interest.  Text wrangling tools can help you collect everything of potential interest, and worry later about whether you’ve already covered these items.

Suppose you need content about all of your products.  You can create a list of all your products, with each product on a separate line.  If you have many product variations that sound similar, it can be confusing to know if it’s already in the list.  If the list is a spreadsheet, it is easy to remove duplicates.  If the list is a text file, TextWrangler has a feature called “Kill Duplicates”.  To spot near-duplicates, sorting the lines will often reveal suspiciously-similar items.

Spotting duplicate or redundant paragraphs takes an extra step.  To compare alternative paragraphs, put them in separate files.  TextWrangler allows you to compare two files using the function “Find Differences”.  Both files are displayed side-by-side, with their differences highlighted.  This approach is more flexible than a word processor, which can assume you want to merge documents, or choose which document is the right one.

Harmonizing Style

A big task when assembling text from many sources is harmonizing the style.  Different texts may use different terminology.

Text editors, like word processors, have “Find and Replace” functionality, but they offer even better tools.  Find and Replace is inefficient because it assumes you have one word you already know you need to replace with another word.  Suppose instead you have many different words referring to the same concept?  Suppose you aren’t sure what would be the best replacement?  This is where the magic of “multiple selections” comes into play.

Brackets’ Multiple Selections feature can let you make many edits at once.  (Sublime Text has a similar feature).  All you need to do is highlight all the words that you want to change.  Then, you type over all the highlighted text at once, and see the changes happen as you type.  Words change on many lines at once, and you can try out different text to decide which works best across different sentences.  And to repeat: the highlighted words being changed don’t need to be the same word.  If you have some sentences that talk about dogs, some sentences that talk about canines, and some sentences that talk about mutts, you can highlight all these words (dogs, canines, mutts), and change them all to “hounds” — before deciding to say “man’s best friend” instead.

Brackets has a related feature called “Multiple Cursors” that is also amazing.  It allows you to place your cursor on multiple lines, and edit multiple lines at once.  Suppose you want to decide the best construction for some headings.  You want to know if saying “{Product X} helps you {Benefit Y}” is best, or “{Product X} makes it possible to {Benefit Y}”.  You list all the products and their respective benefits on separate lines.  Then you can edit all the headings at once, and try out each variation to see which sounds best.

Shifting Perspective

Many wording changes involve changing voice, or flipping emphasis.  Do you want to discuss a task using the imperative “invest” to emphasize action, or by using the gerund “investing” to emphasize a series of activities?  If you have many such tasks, you might want to put them in a list of statements, and try out both options.  You can then decide on a consistent approach.

In addition to the multiple cursor approach, you can edit multiple lines of text using the “Prefix/Suffix” functionality available in TextWrangler.  This allows you to either insert or remove either a prefix or a suffix to a line.  This could be useful with deciding on the wording of headings.  Maybe you want to see what the headings would sound like if they begin “Case Study: ” or whether they should end with “(Case Study)”.

Skeleton Frameworks

Plain text tools can help you reuse text elements again and again.  This can be useful if you have a template or framework you are using to collect text.

Sublime Text has a feature called “Snippets”, where you can store any text you want to reuse, and inject it into any file you are working with.

Another option is a small utility called Paste, which works with any application on a Mac.  It is like a huge clipboard, where you can store large snippets of text, give these snippets names, and reuse them wherever you may need them.

Adding Markup to Plain Text

Plain text is great for writing and editing.  But eventually it will need some markup to become more useful.  Several options are available to turn plain text into web text.

Many writers have adopted Markdown. You can add Markdown syntax to the text, and convert the text to HTML.

You can also add basic HTML elements to plain text using TextSoap, which is a utility that can be used in any Mac application. You simply highlight the words you want to tag, and choose the HTML element you want to use. This option may be desirable if you need elements that aren’t well supported in Markdown.

The most robust option is to use the tagging functionality available in some text editors.  You can add markup using Brackets’s “Surround” extension, where you highlight your text, then define any tagging you want to place around the text.  Sublime Text has a similar feature: “Tag > Wrap Selection.”  These features let you add metadata beyond simple HTML elements; for example, to indicate in what language a phrase is.

Limitations of Text Wrangling

Text wrangling techniques can be handy in many situations, but will be inefficient in others.  They are intended for early content development work.  As I’ve discovered, they can be helpful for assembling text to prototype content.

These techniques aren’t efficient for editing massive content repositories, or editing single documents that aren’t very long.  If you need to migrate large volumes of content, you’ll want some custom scripts written to transform that content appropriately.

Text wrangling focuses on redrafting raw text, rather than collaboration, which will generally get delegated to another platform such as GitHub. Word processing apps offer better support for the review of well-defined drafts, where comments and change tracking are important.  If you are editing or rewriting individual documents, especially in collaboration with others, tools such as Google Docs that track comments will be a better option.

Someday, I hope someone will develop the perfect tool to edit text.  Until then, using a combination of tools is the best option.

—Michael Andrews