Categories
Agility

XML, Latin, and the demise or endurance of languages

We are living in a period of great fluctuation and uncertainty.  In nearly every domain — whether politics, business, technology, or health policy — people are asking what is the foundation upon which the future will be built.  Even the very currency of language doesn’t seem solid.  We don’t know if everyone agrees what concepts mean anymore or what’s considered the source of truth.

Language provides a set of rules and terms that allow us to exchange information.  We can debate if the rules and terms are good ones — supporting expression.  But even more important is whether other groups understand how to use these rules and terms.  Ubiquity is more important than expressiveness because a rich language is not very useful if few people can understand it.

I used to live in Rome, the Eternal City.  When I walked around, I encountered Latin everywhere: it is carved on ancient ruins and Renaissance churches.  No one speaks Latin today, of course. Latin is a dead language.  Yet there’s also no escaping its legacy.  Latin was ubiquitous and is still found in scattered around in many places, even though hardly anyone understands it today.  Widely used languages such as Latin may die off over time but they don’t suddenly disappear.  Slogans in Latin still appear on our public buildings and monetary currency.  

I want to speculate about the future of the XML markup language and the extent to which it will be eternal.  It’s a topic that elicits diverging opinions, depending on where one sits.  XML is the foundation of several standards advocated by certain content professionals.  And XML is undergoing a transition: it’s lost popularity but is still present in many areas of content. What will be the future role of XML for everyday online content?  

In the past, discussions about XML could spark heated debates between its supporters and detractors.  A dozen years ago, for example, the web world debated the XHTML-2 proposal to make HTML compliant with XML. Because of its past divisiveness, discussions comparing XML to alternatives can still trigger defensiveness and wariness among some even now. But for most people, the role of XML today is not a major concern, apart from a small number of partisans who use XML either willingly or unwillingly.  Past debates about whether XML-based approaches are superior or inferior to alternatives are largely academic at this point. For the majority of people who work with web content, XML seems exotic: like a parallel universe that uses an unfamiliar language.   

Though only a minority of content professionals focus on XML now, everyone who deals with content structure should understand where XML is heading. XML continues to have an impact on many things in the background of content, including ways of thinking about content that are both good and bad.   It exerts a silent influence over how we think about content, even for those who don’t actively use it. The differences between XML and its alternatives are rarely directly discussed much now, having been driven under the surface, out of view — a tacit truce to “agree to disagree” and ignore alternatives.  That’s unfortunate, because it results in silos of viewpoints about content that are mutually contradictory.  I don’t believe choices about the structural languages that define communications should be matters of personal preferences, because many consequences result from these choices that affect all kinds of stakeholders in the near and long term.  Language, ultimately, is about being able to construct a common meaning between different parties — something content folks should care about deeply, whatever their starting views.

XML today

Like Latin, XML has experienced growth and decline.  

XML started out promising to provide a universal language for the exchange of content.  It succeeded in its early days in becoming the standard for defining many kinds of content, some of which are still widely used.  A notable example is the Android platform, first released in 2008, which uses XML for screen layouts. But XML never succeeded in conquering the world by defining all content.  Despite impressive early momentum, XML for the past decade seems to be less important each passing year.  Android’s screen layout was arguably the last major XML-defined initiative.  

A small example is XML’s demise of RSS feeds.  RSS was one of the first XML formats for content and was instrumental in the expansion of the first wave of blogging.  However, over time, fewer and fewer blogs and websites actively promoted RSS feeds.  RSS is still used widely but has been eclipsed by other ways of distributing content.  Personally, I’m sorry to see RSS’s decline.  But I am powerless to change that.  Individuals must adapt to collectively-driven decisions surrounding language use.    

By 2010, XML could no longer credibly claim to be the future of content.  Web developers were rejecting XML on multiple fronts:

  • Interactive websites, using an approach then referred to as AJAX (the X standing for XML), stopped relying on XML and started using the more web-friendly data format known as JSON, designed to work with Javascript, the most popular web programming language. 
  • The newly-released HTML5 standard rejected XML compatibility.  
  • The RESTful API standard for content exchange started to take off, which embraced JSON over XML.  

Around the same time, web content creators were getting more vocal about “the authoring experience” — criticizing technically cumbersome UIs and demanding more writer-friendly authoring environments.  Many web writers, who generally weren’t technical writers or developers, found XML’s approach difficult to understand and use.  They preferred simpler options such as WordPress and Markdown.  This shift was part of a wider trend where employees expect their enterprise applications to be as easy to use as their consumer apps. 

The momentum pushing XML into a steady decline had started.  It retreated from being a mainstream approach to becoming one used to support specialized tasks.  Its supporters maintained that while it may not be the only solution, it was still the superior one.  They hoped that eventually the rest of the world would recognize the unique value of what XML offered and adopt it, scrambling to play catch up.  

That faith in XML’s superiority continues among some.  At the Lavacon content strategy conference this year, I continued to hear speakers, who may have worked with XML for their entire careers, refer to XML as the basis of “intelligent content.”  Among people who work with XML, a common refrain is that XML makes content future-ready.  These characterizations imply that if you want to be smarter with content and make it machine-ready, it needs to be in XML.  The myth that XML is the foundation of the future has been around since its earliest days.  Take the now-obscure AI markup language, AIML, created in 2001, which was an attempt to encode “AI” in XML.  It ended up being one of many zombie XML standards that weren’t robust enough for modern implementations and thus weren’t widely used.  Given trends in XML usage, it seems likely that other less mainstream XML-centric standards and approaches will face a similar fate.  XML is not intrinsically superior to other approaches.  It is simply different, having both strengths and weaknesses.  Teleological explanations  — implying a grand historical purpose — tend to stress the synergies between various XML standards and tools that provide complementary building blocks supporting the future. Yet they can fail to consider the many factors that influence the adoption of specific languages.  

The AIML example highlights an important truth about formal IT languages: simply declaring them as a standard and as open-source does not mean the world is interested in using them.  XML-based languages are often promoted as standards, but their adoption is often quite limited.  De facto standards — ones that evolve through wide adoption rather than committee decisions — are often more important than “official” standards.  

What some content professionals who advocate XML seem to under-appreciate is how radically developments in web technologies have transformed the foundations of content.  XML became the language of choice for an earlier era in IT when big enterprise systems built in Java dominated.  XML became embedded in these systems and seemed to be at the center of everything.  But the era of big systems was different from today’s.  Big systems didn’t need to talk to each other often: they tried to manage everything themselves.  

The rise of the cloud (specifically, RESTful APIs) disrupted the era of big systems and precipitated their decline.  No longer were a few systems trying to manage everything.  Lots of systems were handling many activities in a decentralized manner.  Content needed to be able to talk easily to other systems. It needed to be broken down into small nuggets that could be quickly exchanged via an API.   XML wasn’t designed to be cloud-friendly, and it has struggled to adapt to the new paradigm. RESTful APIs depend on easy, reliable and fast data exchanges,” something XML can’t offer. 

A few Lavacon speakers candidly acknowledged the feeling that the XML content world is getting left behind.  The broader organization in which they are employed  — marketing, developers, and writers — aren’t buying into the vision of an XML-centric universe.  

And the facts bear out the increasing marginalization of XML.  According to a study last year by Akamai, 83% of web traffic today is APIs and only 17% is browsers.  This reflects the rise of smartphones and other new devices and channels.  Of APIs, 69% use the JSON format, with HTML a distant second. “JSON traffic currently accounts for four times as much traffic as HTML.” And what about XML?   “XML traffic from applications has almost disappeared since 2014.”  XML is becoming invisible as a language to describe content on the internet.

Even those who love working with XML must have asked themselves: What happened?  Twenty years ago, XML was heralded as the future of the web.  To point out the limitations of XML today does not imply XML is not valuable.  At the same time, it is productive to reality-check triumphalist narratives of XML, which linger long after its eclipse.  Memes can have a long shelf life, detached from current realities.  

XML has not fallen out of favor because of any marketing failure or political power play.  Broader forces are at work. One way we can understand why XML has failed, and how it may survive, is by looking at the history of Latin.

Latin’s journey from universal language to a specialized vocabulary

Latin was once one of the world’s most widely-used languages.  At its height, it was spoken by people from northern Africa and western Asia to northern Europe.

The growth and decline of Latin provides insights into how languages, including IT-flavored ones such as XML, succeed and fail.  The success of a language depends on expressiveness and ubiquity.

Latin is a natural language that evolved over time, in contrast to XML, which is a formal language intentionally created to be unambiguous.  Both express ideas, but a natural language is more adaptive to changing needs.  Latin has a long history, transforming in numerous ways over the centuries.

In Latin’s early days during the Roman Republic, it was a widely-spoken vernacular language, but it wasn’t especially expressive.  If you wanted to write or talk about scientific concepts, you still needed to use Greek.  Eventually, Latin developed the words necessary to talk about scientific concepts, and the use of Greek by Romans diminished.  

The collapse of the Roman Empire corresponded to Latin’s decline as a widely-spoken vernacular language.  Latin was never truly monolithic, but without an empire imposing its use, the language fragmented into many different variations, or else was jettisoned altogether.  

In the Middle Ages, the Church had a monopoly on learning, ensuring that Latin continued to be important, even though it was not any person’s “native” language.  Latin had become a specialized language used for clerical and liturgical purposes.  The language itself changed, becoming more “scholastic” and more narrow in expression. 

By the Renaissance, Latin morphed into being a written language that wasn’t generally spoken. Although Latin’s overall influence on Europeans was still diminishing, it experienced a modest revival because legacy writings in Latin were being rediscovered.  It was important to understand Latin to uncover knowledge from the past — at least until that knowledge was translated into vernacular languages.  It was decidedly “unvernacular”: a rigid language of exchange.  Erasmus wrote in Latin because he wanted to reach readers in other countries, and using Latin was the best means to do that, even if the audience was small.  A letter written in Latin could be read by an educated person in Spain or Holland, even if those people would normally speak Spanish or Dutch.   Yet Galileo wrote in Italian, not Latin, because his patrons didn’t understand Latin.  Latin was an elite language, and over time size of the elite who knew Latin became smaller.

Latin ultimately died because it could not adapt to changes in the concepts that people needed to express, especially concerning new discoveries, ideas, and innovations.

Latin has transitioned from being a complete language to becoming a controlled vocabulary.  Latin terms may be understood by doctors, lawyers, or botanists, but even these groups are being urged to use plain English to communicate with the public.  Only in communications among themselves do they use Latin terms, which can be less ambiguous than colloquial ones. 

Latin left an enduring legacy we rarely think about. It gave us the alphabet we use, allowing us to write text in most European languages as well as many other non-European ones.  

XML’s future

Much as the collapse of the Roman Empire triggered the slow decline of Latin, the disruption of big IT systems by APIs has triggered the long term decline of XML.  But XML won’t disappear suddenly, and it may even change shape as it tries to find niche roles in a cloud-dominated world.  

Robert Glushko’s book, The Discipline of Organizing, states: “‘The XML World’ would be another appropriate name for the document-processing world.”  XML is tightly fused to the concept of documents — which are increasingly irrelevant artifacts on the internet.  

The internet has been gradually and steadily killing off the ill-conceived concept of “online documents.”  People increasingly encounter and absorb screens that are dynamically assembled from data.  The content we read and interact with is composed of data. Very often there’s no tangible written document that provides the foundation for what people see.  People are seeing ghosts of documents: they are phantom objects on the web. Since few online readers understand how web screens are assembled, they project ideas about what they are seeing.  They tell themselves they are seeing “pages.” Or they reify online content as PDFs.  But these concepts are increasingly irrelevant to how people actually use digital content.  Like many physical things that have become virtual, the “online document” doesn’t really resemble the paper one.  Online documents are an unrecognized case of skeuomorphism.

None of this is to say that traditional documents are dead.  XML will maintain an important role in creating documents. What’s significant is that documents are returning to their roots: the medium of print (or equivalent offline digital formats).  XML originally was developed to solve desktop publishing problems.  Microsoft’s Word and PowerPoint formats are XML-compatible, as is Adobe’s PDF format. Both these firms are trying to make these “office” products escape the gravity-weight of the document and become more data-like.  But documents have never fit comfortability in an interactive, online world.  People often confuse the concepts of “digital” and “online”.  Everything online is digital, but not everything digital is online or meant to be.  A Word document is not fun to read online.  Most documents aren’t.  Think about the 20-page terms and conditions document you are asked to agree to.  

A document is a special kind of content.  It’s a highly ordered large-sized content item.  Documents are linear, with a defined start and finish.  A book, for example, starts with a title page, provides a table of contents, and ends with an appendix and index  Documents are offline artifacts.  They are records that are meant to be enduring and not change. Most online content, however, is impermanent and needs to change frequently. As content online has become increasingly dynamic, the need for maintaining consistent order has lessened as well.  Online content is accessed non-linearly.  

XML promoted a false hope that the same content could be presented equally well both online and offline — specifically, in print.  But publishers have concluded that print and online are fundamentally different.  They can’t be equal priorities.  Either one or the other will end up driving the whole process.  For example, The Wall Street Journal, which has an older subscriber base, has given enormous attention to their print edition, even as other newspapers have de-emphasized or even dropped theirs.  In a review of their operations this past summer, The Journal found that their editorial processes were dominated by print because print is different.  Decisions about content are based on the layout needs of print, such as content length, article and image placement, as well as the differences in delivering a whole edition versus delivering a single article.  Print has been hindering the Journal’s online presence because it’s not possible to deliver the same content to print and screen as equally important experiences.  As result, the Journal is contemplating de-emphasizing print, making it follow online decisions, rather than compete with them.

Some publishers have no choice but to create printable content.  XML will still enjoy a role in industrial-scale desktop publishing.  Pharmaceutical companies, for example, need to print labels and leaflets explaining their drugs.  The customer’s physical point of access to the product is critical to how it is used — potentially more important than any online information.  In these cases, the print content may be more important than the online content, driving the process for how online channels deliver the content.  Not many industries are in this situation and those that are can be at risk of becoming isolated from the mainstream of web developments.  

XML still has a role to play in the management of certain kinds of digital content.  Because XML is older and has a deeper legacy, it has been decidedly more expressive until recently.  Expressiveness relates to the ability to define concepts unambiguously.  People used to fault the JSON format for lacking a schema like XML has, though JSON now offers such a schema.  XML is still more robust in its ability to specify highly complex data structures, though in many cases alternatives exist that are compatible with JSON.   Document-centric sectors such as finance and pharmaceuticals, which have burdensome regulatory reporting requirements, remain heavy users of XML.  Big banks and other financial institutions, which are better known for their hesitancy than their agility, still use XML to exchange financial data with regulators. But the fast-growing FinTech sector is API-centric and is not XML-focused.  The key difference is the audience.  Big regulated firms are focused on the needs of a tightly knit group of stakeholders (suppliers, regulators, etc.) and prioritize the bulk exchange of data with these stakeholders.  Firms in more competitive industries, especially startups, are focused on delivering content to diverse customers, not bulk uploads.  

XML and content agility

The downside of expressiveness is heaviness.  XML has been criticized as verbose and heavy — much like Victorian literature.  Just as Dickensian prose has fallen out of favor with contemporary audiences, verbose markup is becoming less popular.  Anytime people can choose between a simple way or a complex one to do the same thing, they choose the simple one.  Simple, plain, direct. They don’t want elaborate expressiveness all the time, only when they need it.  

When people talk about content as being intelligent (understandable to other machines), they may mean different things.  Does the machine need to be able to understand everything about all the content from another source, or does it only need to have a short conversation with the content?  XML is based on the idea that different machines share a common schema or basis of understanding. It has a rigid formal grammar that must be adhered to. APIs are less worried about each machine understanding everything about the content coming from everywhere else.  It only cares about understanding (accessing and using) the content it is interested in (a query). That allows for more informal communication. By being less insistent on speaking an identical formal language, APIs enable content to be exchanged more easily and used more widely.  As a result, content defined by APIs more ubiquitous: able to move quickly to where it’s needed.  

Ultimately, XML and APIs embrace different philosophies about content.  XML provides a monolithic description of a huge block of content.  It’s concerned with strictly controlling a mass of content and involves a tightly coupled chain of dependencies, all of which must be satisfied for the process to work smoothly.  APIs, in contrast, are about connecting fragments of content.  It’s a decentralized, loosely coupled, bottom-up approach.  (The management of content delivered by APIs is handled by headless content models, but that’s another topic.)

Broadly speaking, APIs treat the parts as more important than the whole.  XML treats the whole as more important than the parts.  

Our growing reliance on the cloud has made it increasingly important to connect content quickly.  That imperative has made content more open.  And openness depends on outsiders being able to understand what the content is and use it quickly.  

As XML has declined in popularity, one of its core ideas has been challenged.  The presumption has been that the more markup in the content, the better.  XML allows for many layerings of markup, which can specify what different parts of text concern.  The belief was that this was good: it made the text “smarter” and easier for machines to parse and understand.  In practice, this vision hasn’t happened.  XML-defined text could be subject to so many parenthetical qualifications that it was like trying to parse some arcane legalese.  Only the author understood what was meant and how to interpret it.  The “smarter” the XML document tried to be, the more illegible it became to people who had to work with the document — other authors or developers who would do something later with the content.    Compared with the straightforward language of key-value pairs and declarative API requests, XML documentation became an advertisement pointing out how difficult its markup is to use.  “The limitations in JSON actually end up being one of its biggest benefits. A common line of thought among developers is that XML comes out on top because it supports modeling more objects. However, JSON’s limitations simplify the code, add predictability and increase readability.”  Too much expressiveness becomes an encumbrance.  

Like any monolithic approach, XML has become burdened by details as it has sought to address all contingencies.  As XML ages, it suffers from technical debt.  The specifications have grown, but don’t necessarily offer more.  XML’s situation today similar to Latin’s situation in the 18th century, when scientists were trying to use it to communicate scientific concepts.  One commenter asserts that XML suffers from worsening usability: “XML is no longer simple. It now consists of a growing collection of complex connected and disconnected specifications. As a result, usability has suffered. This is because it takes longer to develop XML tools. These users are now rooting for something simpler.”  Simpler things are faster, and speed matters mightily in the connected cloud.  What’s relevant depends on providing small details right when they are needed.

At a high level, digital content is bifurcating between API-first approaches and those that don’t rely on APIs.  An API-first approach is the right choice when content is fast-moving.  And nearly all forms of content need to speed up and become more agile.  Content operations are struggling to keep up with diversifying channels and audience segmentation, as well as the challenges of keeping the growing volumes of online content up-to-date.  While APIs aren’t new anymore, their role in leading how content is organized and delivered is still in its early stages.  Very few online publishers are truly API-first in their orientation, though the momentum of this approach is building.

When content isn’t fast-moving, APIs are less important. XML is sometimes the better choice for slow-moving content, especially if the entire corpus is tightly constructed as a single complex entity.  Examples are legal and legislative documents or standards specifications. XML will still be important in defining the slow-moving foundations of certain core web standards or ontologies like OWL — areas that most web publishers will never need to touch.  XML is best suited for content that’s meant to be an unchanging record.  

 Within web content, XML won’t be used as a universal language defining all content, since most online content changes often.  For those of us who don’t have to use XML as our main approach, how is it relevant?  I expect XML will play niche roles on the web.  XML will need to adapt to the fast-paced world of APIs, even if reluctantly.  To be able to function more agilely, it will be used in a selective way to define fragments of content.  

An example of fragmental XML is how Google uses the SSML standard, an XML-defined speech markup standard to indicate speech emphasis and pronunciation.  This standard predates the emergence of consumer voice interfaces, such as “Hey Google!” Because it was in place already, Google has incorporated it within the JSON-defined schema.org semantic metadata they use.  The XML markup, with its angled brackets, is inserted within the quote marks and curly brackets of JSON.   JSON describes the content overall, while XML provides assistance to indicate how to say words aloud. 

SVG, used to define vector graphics, is another example of fragmental XML.  SVG image files are embedded in or linked to HTML files without needing to have the rest of the content be in XML.

More generally, XML will exist on the web as self-contained files or as snippets of code.  We’ll see less use of XML to define the corpus of text as a whole.  The stylistic paradigm of XML, of using in-line markup  — comments within a sentence — is losing its appeal, as it is hard for both humans and machines to read and parse. An irony is that while XML has made its reputation for managing text, it is not especially good at managing individual words.  Swapping words out within a sentence is not something that any traditional programming approach does elegantly, whether XML-based or not, because natural language is more complex than an IT language processor.  What’s been a unique advantage of XML — defining the function of words within a sentence — is starting to be less important.  Deep learning techniques (e.g., GPT-3) can parse wording at an even more granular level than XML markup, without the overhead.  Natural language generation can construct natural sounding text.  Over time, the value of in-line markup for speech, such as used in SSML, will diminish as natural language generation improves its ability to present prosody in speech.  While deep learning can manage micro-level aspects of words and sentences, it is far from being about to manage structural and relational dimensions of content.  Different approaches to content management, whether utilizing XML or APIs connected to headless content models, will still be important.  

As happened with Latin, XML is evolving away from being a universal language.  It is becoming a controlled vocabulary used to define highly specialized content objects.  And much like Latin gave as the alphabet upon which many languages are built, XML has contributed many concepts to content management that other languages will draw up for years to come.  XML may be becoming more of a niche, but it’s a niche with an outsized influence.

— Michael Andrews