Here’s how indexing could evolve with ebooks
Last month I shared some thoughts about how indexes seems to be a thing of the past, at least when it comes to ebooks. I’ve given more consideration to the topic and would like to offer a possible vision for the future.
Long ago I learned the value an exceptional indexer can bring to a project. For example, there’s a huge difference between simply capturing all the keywords in a book and producing an index that’s richly filled with synonyms, cross-references and related topics. And while we may never be able to completely duplicate the human element in a computer-generated index I’d like to think value can be added via automated text analysis, algorithms and all the resulting tags.
Perhaps it’s time to think differently about indexes in ebooks. As I mentioned in that earlier article, I’m focused exclusively on non-fiction here. Rather than a static compilation of entries in the book I’m currently reading, I want something that’s more akin to a dynamic Google search.
Let me tap a phrase on my screen and definitely show me the other occurrences of that phrase in this book, but let’s also make sure those results can be sorted by relevance, not just the chronological order from the book. Why do the results have to be limited to the book I’m reading though? Maybe that author or publisher has a few other titles on that topic or closely related topics. Those references and excerpts should be accessible via this pop-up e-index as well. If I own those books I’m able to jump directly to the pages within them; if not, these entries serve as a discovery and marketing vehicle, encouraging me to purchase the other titles.
This approach lends itself to an automated process. Once the logic is established, a high-speed parsing tool would analyze the content and create the initial entries across all books. The tool would be built into the ebook reader application, tracking the phrases that are most commonly searched for and perhaps refining the results over time based on which entries get the most click-thru’s. Sounds a lot like one of the basic attributes of web search results, right?
Note that this could all be done without a traditional index. However, I also see where a human-generated index could serve as an additional input, providing an even richer experience.
How about leveraging the collective wisdom of the community as well? Provide a basic e-index as a foundation but let anyone contribute their own thoughts and additions to it. Don’t force the crowdsourced results on all readers. Rather, let each consumer decide which other members of the community add the most value and filter out all the others.
This gets back to a point I’ve made a number of times before. We’re stuck consuming dumb content on smart devices. As long as we keep looking at ebooks through a print book lens we’ll never fully experience all the potential a digital book has to offer.
Nice! Yes, the thing that the human-generated "index" brings to the mix is the term relevance and introduction of synonyms. Synonyms are the big piece missing from most searches. This can be provided by including a domain-specific set of synonyms in the search index, but if you've already got a "real" index, that might just be all that's needed. Finding a simple way to provide an index-augmented full-text search would be ideal, IMHO.
Posted by: Saprentice | April 18, 2016 at 01:39 PM
This comes closer to a solution for "indexing" nonfiction e-books than anything I've encountered, and I think that's largely because of your underlying premise that we must stop "looking at ebooks through a print book lens." (Digital publishing holds such potential that most book publishers completely miss because it doesn't follow the printed-page "rules.") I worry about one major obstacle to accomplishing your suggestion. Perhaps I'm mis-comprehending what you mean in this proposition: "Once the logic is established, a high-speed parsing tool would analyze the content and create the initial entries across all books," but such a tool would be called on to analyze text on thousands of disciplines about which people write nonfiction. Every field, not to mention every "school" within it, has its own "language." Adepts understand them, but most initiates have to come to grips with a new vocabulary and ways of thinking before such texts become meaningful. Could a logic be established that first establishes which foundational ideas and terms fit a given discipline—including "smart" cross referencing to accommodate schools and offshoots—then adjust its parsing tool to accommodate that field along with myriad others? If so, what a boon that would be to readers and researchers! (And what a hard sell to traditional publishers…) I can tell you that, once you or someone else develops such algorithms, I'll be first in line as a beta tester! Thanks for the post.
Posted by: Brian Hotchkiss | April 19, 2016 at 11:08 AM
You make an excellent point regarding the complexity of this task, Brian. I'm assuming this parsing tool would have some inputs (e.g., known keywords and perhaps synonyms for those keywords) as well as the ability to get smarter as it processes more text. It's easy for me to say and hard for someone to develop, I know, but I'd like to think it's possible.
Posted by: Joe Wikert | April 19, 2016 at 01:23 PM
Indeed, we in the indexing world have looked at the results of concordance tools for quite a while, and like search engines, when you are dealing with specific terminology, the concordance engines must be fine-tuned to get results that come close to index precision. It usually takes 4-5 times longer to tune the engine than it does to index the content, and the tuning's output has to be heavily cleaned to get out the garbage. But if you leverage the work the indexer has already done, combine it with synonym rings and other such content-rich tools, you could get what Scott refers to above, an "index-augmented" view of content.
The EPUB 3.0 spec for indexes has a lot features built into it to support markup and display in nearly every context imaginable, as long as the hardware and operation system of the reading device has been programmed to put it to use. We are limited by the hardware manufacturers and the rate by which they incorporate the 3.0 spec. Otherwise, we could already be marking up a web of connections beneath the content, ready to be displayed to the user in lots of ingenious ways. Charts, links, satellites, ranges, connections, we could do a lot. But we need the reader manufacturers and op systems to start supporting markup, and make it desirable to do. Or we won't see rich content navigation.
The readers I see these days still have not developed much, and seem stuck in old metaphors.
Posted by: Jan Wright | April 19, 2016 at 06:05 PM
Yikes! This is many times more complex than I imagined. What worries me most, however, is that readers--especially younger/newer readers--will have no sense of the value of what we call an index, and demand for and use of such tools will fall away through lack of interest or ignorance of its value. (Of course, maybe by that time, such a fate may be warranted…) It reminds me of occasions when people, who are as old as their early 30s, stop me to ask what time it is. When I've showed them my (analog) watch, they apologize and tell me they don't know how to "interpret that." (It's true, that's the phrase most have used!)
Posted by: Brian Hotchkiss | April 24, 2016 at 11:03 AM