Project Bamboo wiki: Corpora Camp Lessons Learned Report

This was originally published on the Project Bamboo wiki at https://wiki.projectbamboo.org/display/BTECH/Corpora+Camp+Lessons+Learne... (https://wiki.projectbamboo.org/x/-AZYAQ) by Seth Denbo, last modified by Travis Brown on 17 May 2011.

I. Introduction
II. Overview
III. Collections
III.1 Common representation and data store
III.2 Hathi texts
IV. Architecture
IV.1 Cloud Computing
IV.2 Collection Interoperability
IV.3 Support for Wide Range of Interfaces
V. Application
V.1 Analysis
V.2 Visualization design
V.3 Visualization software
VI. Recommendations
VI.1 Methods for text exploration
VII. Conclusion

I. Introduction

From March 2-4, 2011 at the University of Maryland, the Bamboo Technology Project Corpora Space work group held CorporaCamp, the first of three planned workshops. These events are part of the design phase of Corpora Space, which began in January 2011 and continues for fifteen months. The primary outcome of this phase is a road map for the subsequent eighteen-month implementation of Corpora Space, so all of our activities are focused on informing this document.

During CorporaCamp the participants designed and developed a tool for the exploration of texts from distributed, large-scale collections. The primary purpose of this exercise in tool building was to gain a greater understanding of the challenges involved in building Corpora Space infrastructure. This report will address the lessons learned in the three main areas on which CorporaCamp participants worked: the development of a platform architecture for Corpora Space, a prototype application, and interoperability among the collections to facilitate research.

The overview section below lists the most significant lessons learned, presenting them in three sections: the trade-offs we faced, the challenges we identified, and the successes we had. The rest of this report presents a more detailed account of our process and the lesson learned.

II. Overview

  1. Trade-offs: CorporaCamp forced us to identify and negotiate several trade-offs involved in implementing a piece of functionality across diverse collections, including most importantly the following:
    1. Data representations: The collections we were working with had very different degrees of structure and annotation, and we had to decide whether to enrich the less structured texts or flatten the more structured ones. Given our limited resources we adopted the latter approach in most cases.
    2. Architecture: We had to balance the advantages of a distributed architecture where independent agents interact through a switchboard against those of more traditional web application frameworks, where components are tightly coupled. We developed our first working prototype using the second approach, but also built a set of distributed components (using the platform API that we've code-named Utukku) that perform many pieces of the functionality.
    3. Legacy systems and emerging technologies: We made a design decision in this case to be forward-looking in our decisions about standards and technologies — for example by adopting HTML5 and the WebSocket protocol in the distributed version of the application. While this approach offers many advantages, it also shuts out some users.
    4. External software libraries: At several points in the development process we had to choose between general libraries (or tools) that would allow our core functionality to be more easily extended in the future, and more limited tools that solved our immediate goals with less work. In some cases — the visualization code, for example — we took the latter approach for the prototype but have partial implementations using more general tools.
    5. User interfaces and visualization: In our initial design plans the visualizations that we intended to present to the user were very simple. Once we had a working prototype, we realized that we needed to add elements to the interface in order to allow users to navigate the data in a useful manner. This additional complexity requires more user engagement and training.
  2. Challenges: For most of these trade-offs we were able to balance the opposed concerns satisfactorily, but the issue of creating interoperable representations of data from diverse sources in particular posed problems that we were not able to address in the scope of the workshop:
    1. Metadata: Deciding on a schema for metadata that works across collections is complex, even for a relatively simple application.
    2. Provenance and versioning: We need to be able to record and provide access to information about changes to objects in the collection: when the change occurred, who made the change and provide information about that person, and what exactly was changed.
  3. Successes: In some cases we believe that the decisions we made proved particularly successful:
    1. Leveraging existing tools and services: We demonstrated that it is possible to piece together a diverse set of resources quickly and effectively. These resources included the following:
      1. Cloud computing: Using Amazon's Elastic Cloud Computing (EC2) service we were able to provide each team at the workshop with uniform development servers that could be scaled as necessary.
      2. Data store: We were able to use ElasticSearch, a Lucene-based search engine that provides a REST interface, as a lightweight and flexible data store.
      3. Analysis: UMass's MALLET toolkit and the Colt library — developed by CERN for "High Performance Scientific and Technical Computing in Java" — allowed us to perform complex analysis of our documents efficiently.
    2. Interfacing with collections: Bamboo doesn't have to control collections in order to make gains — our use of ElasticSearch's REST API demonstrates that we can easily interface with external collections.
    3. Extensibility: We demonstrated that our distributed approach makes it relatively easy for users to build applications — for example a JavaScript-based web search interface — that work with our architecture without requiring the user to have access to servers or other infrastructure of their own.

The rapid development process of the workshop required us constantly to balance our long-term goals — experimenting with a distributed, extensible architecture — against our desire to have a working prototype implemented at the end of the three days. In many cases we had two development threads running in parallel, with one group working on a more general solution and another on a simpler fallback. We believe that this process provided us with a better sense of the problems and decisions — and the range of consequences of those decisions — that we'll be faced with in developing future Corpora Space applications. In particular we found that we had underestimated the difficulty of collections interoperability in our preparation for the workshop, while we were much more successful in our use of a diverse set of platforms, tools, and libraries.

The following sections of this document discuss the issues outlined above in more detail as they relate to the collections, architecture, and the functionality of the application.

III. Collections

III.1 Common representation and data store

The functionality that we had decided to implement at the workshop was designed to operate on very simple representations of the texts from our three collections. Because these collections used two very different formats — TEI-A in the case of TCP and Perseus and a simple page-based plain text format for Hathi — we had a choice between attempting to add structure to the Hathi texts, in order to bring them closer to TCP and Perseus, or to remove structure from the TCP and Perseus texts. While preparing for the workshop we ran a series of experiments on the Hathi texts that suggested that the latter would be a more practical approach, and during the first sessions of the workshop we decided to use a JSON format that would include some basic metadata about each document as well as two simple representations of its content: a plain-text version for use in analysis, and an HTML version for display in the drill-down view.

We decided to use ElasticSearch, a Lucene-based full-text search engine, to store and query our texts. ElasticSearch is properly an index, not a data store, but since the functionality we were implementing did not require writing to the collection, we decided that querying ElasticSearch through its REST API would be an appropriate approximation of the way that Corpora Space applications might interact with external collections in the future.

III.2 Hathi texts

The Hathi Trust provided us with a bulk transfer of approximately 120,000 public-domain texts that had been digitized by parties other than Google, and therefore have no restrictions on use or redistribution. Because of the focus of our other collections we decided to work primarily with a subset of the Hathi texts published before 1837. The publication date field in the metadata was not standardized, and we decided to err on the side of excluding texts when the field couldn't be easily parsed. We also filtered out a small set of documents that were marked as being in the public domain only in the United States. This selection process left us with approximately 10,000 texts, many of which were duplicates, with the same edition having been digitized at multiple institutions. We did not have the resources at the workshop to develop a consistent method for filtering duplicates given the available metadata.

The quality of the OCR for these texts posed a more serious challenge. The following example from an early edition of Gulliver's Travels is representative of texts published before 1800:

wLILLIPUT. 147 I ftayed but two Months with my Wife and Family 5 for my infatiable De- fire of feeing foreign Countries would fuffer me to continue no longer. I left fifteen hundred Pounds with my Wife, and fixed her in a good Houfe at Red- riff. My remaining Stock I carried with me, part in Money, and part in Goods, in hopes to improve my For- tunes. My eldeft Uncle John had left me an Effate in Land, near Epping, of about thirty Pounds a Years and I had along Leafe of the Black-Bull in Fet- ter-Lane, which yielded me as much more: fo that-1 was-not in any danger of leaving my Family upon the Parifh. My Son Johnny, named fo after his Uncle, was at the Grammar School, and atowardly Child. My Daughter Betty (who is now well married, and has Chil- dren) was then at her Needle-Work. I took leave of my-Wife, and Boy and Girl, with tears on both fides, and>went on

Many examples were substantially worse, with a word error rate of 70-80% not being unusual in our initial investigations. The texts also had no consistent layout analysis: only page breaks were captured reliably, and the variation in the OCR output made it difficult to recover paragraph breaks in an automated fashion.

These problems put constraints on the kind of analysis that we could usefully perform on the data. We had initially decided to offer multiple feature representations of the texts, including something like n-grams as stylistic features, but the extremely high word error rates made this approach less useful. In our initial experiments we found that using Latent Dirichlet Allocation for topic modeling provided an alternative way of characterizing documents usefully despite the many errors.

The lack of consistent structure in the Hathi texts also limited our options for selecting appropriate document units for analysis. While it might have possible to use Hathi's OCR coordinate data to identify paragraphs by indentation or page layout, this kind of analysis was beyond the scope of the workshop, and we decided to use pages as our documents for the initial prototype. While this approach is not ideal, since pages generally do not correspond to organic divisions in the text, it seemed to produce interpretable output in our initial experiments.

III.3 Other collections

For the TCP and Perseus collections the primary challenge was dividing the texts into units that would be comparable to the page divisions of the Hathi texts. We also needed to create HTML representations of the individual documents for presentation in the drill-down view, as well as a way to create links back to the original collections.

IV. Architecture

IV.1 Cloud Computing

In preparation for CorporaCamp we built an Amazon EC2 (Elastic Cloud Computing) virtual machine image to facilitate development by allowing the workshop participants to have access to uniform development machines with essential libraries and applications installed and configured in advance. We took as our starting point the official Ubuntu 10.04 LTS image distributed by Canonical and installed and configured Apache Tomcat, ElasticSearch, Git, the MALLET machine learning toolkit, and a number of other libraries and applications.

On the first day of the workshop we started three instances of this machine image: two low-powered instances for specific development groups and one large instance with 7.5 GB of memory and two virtual cores for more computationally intensive processing tasks. This approach allowed us to coordinate development work and share data effectively. After the conclusion of the workshop we turned off all three of these instances and started two new small instances: one to host the prototype we had completed and the other to support ongoing development on the Utukku-based version.

IV.2 Collection Interoperability

It is easy to think that collection interoperability is a universal advantage and that accessing all collections through the same interface is a step forward, but interoperability has costs, and there are situations in which the properties of the collection will need to be preserved. Some collections are built around a specific set of research questions. Not all humanists will want all collections to have the same interface because they have their own questions that might depend on particular properties of the collections. While this may not be as much of an issue with the large collections we looked at (e.g., Haithi Trust, ECCO, EEBO), it will be important when we try to bring in smaller curated collections.

The CorporaCamp platform presents functionality such as an interface into a collection as an XML namespace with a collection of functions. The platform can support multiple collection profiles by assigning a different namespace to each profile. The collections can advertise which profiles they support by having the agent export the corresponding namespaces to the ecosystem.

IV.3 Support for Wide Range of Interfaces

Going in to CorporaCamp, we thought that we should support a wide range of interfaces based on our experience with such systems as Mathematica, which allow users to pose their research question in the way that best fits their question, as well as knowing how steep the learning curve can be with such a system. We also knew that as users became experts with the tools, they would want greater flexibility and control.

The distributed Utukku architecture allowed us to create a web-based JavaScript client as well as a command line client. The command line client allowed us to ask questions of the collections that the web-based client wasn't prepared to allow. For example, we were able to look at the frequency of texts and modify our queries as we explored the question in a way that would be difficult to build into a graphical interface.

We confirmed that having the low-level interface can be useful and powerful as a companion to specialized tools that are built around specific research questions.

V. Application

V.1 Analysis

The implementation of the methods we had chosen to use to analyze and visualize the texts was relatively straightforward. After transforming a selection of texts from our source collections into a common JSON format, we were able to use MALLET, a machine learning toolkit developed at the University of Massachusetts Amherst, to learn a topic model from a subset of the texts and to use that model to annotate the entire selection. The first of these two steps must be done in advance, since it can take up to several hours for large training sets (on the order of 100 million words, for example). The annotation step, in which every document is labeled with topic assignments, could be done dynamically if necessary, but for the sake of efficiency and convenience we also performed it as a preprocessing step.

Our approach to visualization requires principal component analysis to be performed on the user's current selection of texts for each specific experiment. We had used the R environment for PCA in our initial demonstrations, but R is a complex tool and is not primarily designed to support user-friendly web applications. We decided instead to implement this part of the analysis in a library that could run on the Java Virtual Machine, which would allow it to be used from the Ruby version of Utukku via JRuby, as well as from a wide range of web application frameworks.

We were able to create an efficient implementation of PCA in Scala using the linear algebra packages of the Colt library, which was developed by CERN for high-performance scientific computing in Java. This approach is fast and memory-efficient enough to serve dozens of concurrent users from a single Amazon EC2 "micro" instance.

In our early experiments with the application we discovered a few recurring problems with noise in our source texts creating uninteresting artifacts in the visualizations. We were able to resolve most of these issues by adding several additional preprocessing steps: for the texts from Hathi, for example, we removed running heads from the pages, and ignored all pages with fewer than 40 characters. Apart from these minor changes, the analysis components have required very few revisions since the workshop.

V.2 Visualization design

In our experiments with the application on the last day of the workshop, we found that while many features of the visualizations it produced were intuitive and interpretable, others were simply confusing. This problem was compounded by the fact that the drilldown functionality was unsatisfactory as a method for exploring the space represented in the map: examining individual pages was simply too time-consuming. We've attempted to make the map easier to interpret by adding a chart showing the "component loadings," which represent the contributions of topics to the two principal components currently shown in the map. We also added a graph showing the variance captured by each of the first eight principal components. This graph provides a visual explanation of how well the map explains the structure of the data. If the graph falls off quickly after the first two components, that is an indication that the dimensions shown in the map do a good job of summarizing the structure of the data. If the graph slopes more slowly, that indicates the opposite.

These additional elements make the visualization more useful for trained users, but they also complicate an interface that we had initially intended to be very simple. Balancing these concerns — power and simplicity — will be one of the central tasks for any future work on the Woodchipper tool, as well as for any other visualization or analysis application developed or deployed as part of Corpora Space.

V.3 Visualization software

We experimented with two libraries for creating the visualizations in the web browser. The current prototype version of the application uses Flot, a simple plotting library for jQuery. Flot supports a relatively limited range of visualization approaches, but these have proven sufficient for a first implementation of the Woodchipper tool. We also experimented with Raphaël, a much more general JavaScript library for creating graphics on the web, which would probably serve as a more appropriate foundation for future visualization applications.

VI. Recommendations

VI.1 Methods for text exploration

While the primary goal of CorporaCamp was to serve as an experiment in rapid, iterative design and development — not to build an application that would become a central piece of Corpora Space — we did develop a stronger sense of the advantages and disadvantages of the methods we had chosen for scholarly text exploration.

LDA topic modeling provided a very powerful way of characterizing uncorrected, unstructured text, but interpreting the topics it infers often requires experience and training. A topic that shows up in our visualization with the label "church religion gospel holy authority" has a clear interpretation, for example, but the meaning of a topic with the label "voice heard man eyes see" isn’t as immediately obvious. This latter topic plays a central role in many of our experiments, however — it serves as a kind of general narrative topic that often allows the system to draw important distinctions between documents. In future applications using unsupervised topic modeling we would therefore suggest having a trained human annotator add labels instead of simply using the most prominent words from the distribution.

We also recently experimented with adapting the Woodchipper application to a very different kind of text collection: a corpus of syllabi harvested from university websites by Dan Cohen at the Center for History and New Media. Topic modeling also proved useful here, since it was not possible to extract any consistent structure from the 15,000 syllabi we selected. It wasn't as clear that principal component analysis was the best way to perform the dimensionality reduction, however — most experiments did not produce the interpretable clusters that we very commonly saw when looking at literary and historical texts. We are experimenting with adding components implementing other dimensionality-reduction techniques — for example self-organizing maps — that could be substituted for PCA in the current system. The comparative failure of our PCA-based visualization on this collection highlights the need to tailor the methods used for visualization and exploration to the shape of the dataset.

VI.2 Collections Interoperability

We brought together subsets of three collections at Corpora Camp and made one tool work with all of the texts. In the process, we did some pre-processing that made all of the texts have the same metadata. Essentially, we made all of the texts fit a particular profile that was needed for the functionality to work.

Different functionality may need different metadata or collection properties. Instead of trying to make all collections fit a single profile of metadata, structures, and markup, we may want to explore using profiles to balance the need for interoperable collections and collections built with a particular set of properties.

VI.3 Create a Path to an Expert-level Interface

Tools such as the woodchipper are limited to exploring one particular aspect of the texts. People need tools that are simple to use with a shallow learning curve. Based on the concept of flow, in which people enjoy tasks that have a close match between skill and challenge, we need tools to act as stepping stones to more powerful and flexible (and usually more complex) tools. As people exercise their skills against challenges, their skills improve, requiring slightly more complex challenges to restore flow.

VII Conclusion

The trade-offs, challenges and successes of Corpora Camp will inform the design process for Corpora Space. Our proposal for the next workshop will look to broaden out from a single functionality, in order to address questions and problems involved in the combination of multiple tools working across multiple distributed collections. This will lead to the process of determining the road map for the next phase of Corpora Space.