Structured documents for science: JATS XML as canonical content format
It’s only my 7th day on the job here at PLOS as a product manager for content management. So it’s early days, but I’m starting to think about the role of JATS XML in the journal publishing process.
I come from the book-publishing world, so my immediate challenge is to get up to speed on journal publishing. And that includes learning the NISO standard JATS (Journal Archiving and Interchange Tag Suite). You may know JATS by its older name, NLM. As journal publishing folks know, JATS is used for delivering metadata, and sometimes full text, to the various journal archives.
But here’s where journal and book publishing share the same dilemma: just because XML is a critically important exchange format, is it the best authoring format these days? Should it be the canonical storage format for full text content? And how far upstream should XML be incorporated into the workflow?
Let’s look at books for a minute. The book-publishing world has standardized on an electronic delivery format of EPUB (and its cousin, MOBI). This standardization has helped publishers drill down to a shorter list of viable options for canonical source format. Even if most publishers haven’t yet jumped to adopt end-to-end HTML workflows, it’s clear to me that HTML makes a lot of sense for book publishing. Forward-thinking book publishers like O’Reilly are starting to replace their XML workflow with an HTML5/CSS3 workflow. HTML/CSS can provide a great authoring and editing experience, and then it also gets you to print and electronic delivery with a minimum of processing, handling, or conversion. (O’Reilly’s Nellie McKesson gave a presentation about this at TOC 2013.) And which technology will get the most traction and advance the most in the next few years, XML or HTML? I know which one I’m betting on.
In terms of canonical file format, journal publishing may have one less worry than book publishing, because many journals are moving away from print to focus exclusively on electronic delivery whereas most books still have a print component. Electronic journal reading—or at least article discovery—happens in a browser; therefore, HTML is the de facto principal delivery format. And as much as I’d like to think HTML is the only format that matters, I know that many readers still like to download and read articles in PDF format. But as I mentioned, spinning off attractive, readable PDF from HTML is pretty easy to automate these days. So I ask:
If XML is being used as an interchange format only, what do we gain from moving the XML piece of the workflow any further upstream from final delivery?
Well, why does anyone adopt an XML workflow? The key benefits are: platform/software independence (which HTML also provides), managing and remixing content to the node level (which is not terribly useful for journal articles), and transforming the content to a number of different output formats such as PDF, HTML, and XML (HTML5/CSS3 can be used for this transformation as well, with a bit of toolchain development work).
But XML workflows come with a hefty price tag. The obvious one is conversion, which is not just expensive, but costly in terms of the time it takes. Another downside is the learning curve for the people actually interacting with the XML—how many people should that be? In the real world, will you ever get authors, editors, and reviewers to agree to interact with their content as XML? So more likely than not, you’re either going to need to hide the fact that the underlying format is XML through a WYSIWYG-ish editor that you either buy or build (both are expensive), or you’re doing your XML conversion towards the end of the process. On a similar note, how easy is it to hire experienced XSL-FO toolchain developers? But developers who work in the world of HTML5, CSS3, and JavaScript are plentiful.
So building an entire content management system and workflow for journal publishing around XML—specifically JATS XML, which is just one delivery format, that isn’t needed until basically the end of the process—doesn’t seem like a slam-dunk to me. I should clarify that using JATS XML for defining metadata does seem like the obvious way to go. But I’m not so sure it’s a good fit to serve as the canonical storage format for the full text. One idea is to separate article metadata from the article body text, to leverage the ease-of-editing of HTML for the text itself.
What about moving HTML upstream, and focusing efforts on delivering better, more readable HTML in the browser? What about shifting focus away from old print models and toward leveraging modern browser functionality, maybe by adding inline video or interactive models, or by making math, figures, and tables easier to read and work with?
Just to throw a curve ball into the discussion, I attended Markdown for Science last weekend, where Martin Fenner and Stian Håklev led the conversation about whether it makes sense to use markdown plus Git for academic authoring and collaboration. I want to hear from as many sides of the content format conversation as possible.
So, what do YOU think?
This article was written by contributor Molly Sharp, appeared earlier on the PLOS site and has been presented here with permission of the author. Molly has worked in various content management-related roles since the late 90′s, when she led the implementation of an XML editing and production system for Sybex, a tech book publisher. Most recently, Molly was the Director of Content Management at Safari Books Online, an electronic reference library of 30,000 tech & business titles, where she created and managed a Content Team to ensure the quality of incoming content; designed and maintained content-related processes and workflows; and managed a publishing partner community of more than 100 organizations.
Comments