Project Bamboo wiki: Corpora Space - BTP Proposal - 6 July 2010

This was originally posted to the Project Bamboo wiki at (, last modified 1 March 2011.

4 - Project Overview


4.2 Bamboo Corpora Space

By the end of the first 18 months, we will have prepared the design groundwork to launch the second project: applications that will allow scholars to work on dispersed digital corpora using a broad range of powerful research tools and services. We use the term “Bamboo Corpora Space” here to refer to a set of applications that have in common two fundamental characteristics. First, each application will provide powerful services designed to help scholars curate, interpret, depict, and discuss corpora. This is a key category across many disciplines in the humanities because corpora (textual and non-textual) are a central object of study. Second, each application will use an array of scholarly and other web services provided by the Bamboo services platform. This is a key step on the technology side, because it will permit the development of a generation of applications that re-use tools and software components from a common source.

Corpora Space applications could be realized in many ways. A Corpora Space application might be situated within Bamboo Work Spaces; specialized corpora management services can be embedded in a general collaborative and content management environment designed for the humanities. But that is not the only way in which Corpora Space applications could be realized. For example, technologists might draw upon the Bamboo services platform in order to build a Corpora Space application that would work as a “Pipes”1-based environment in which scholars can create their own mini-applications by mashing-up data and tool services. A Corpora Space application might be realized as a lightweight IPad application, or it might take the form of refactoring part of an existing corpora management tool using Bamboo services. Finally, we might envision the development of a dedicated Corpora Space platform, which, like the CollectionSpace2 platform for the museum community, could be readily customized and extended to build and host a wide range of complex corpora applications. (We provide more information about these possible Corpora Space application options in section 5.1).

We are not yet certain which of these options it would be best to take up in phase 2. Thus in phase 1, our goal for Corpora Space is to carry out a structured design process with the participation of the project partners, as well as other institutions, to determine which application or applications to develop in phase 2. In making this determination, this broad range of institutions will need to consider both what is technologically feasible and what is of greatest importance for humanities scholars. The key will be to enable scholars to carry out complex research tasks using a common set of tools and services on dispersed corpora. We expect to learn a great deal from the work carried out by a number of more focused corpora environments in the humanities, for example, in French and English literature, classical scholarship, and linguistics. Our design process, which is described more fully in section 5.4, will include the solicitation, discussion, and review of a series of Corpora Space white papers written by scholars and technologists.

1. See, for example, Yahoo Pipes,!_Pipes.

2. See CollectionSpace at


5 - Technical Approach


5.1 Technology Ecosystem and Strategy


* Corpora Space applications could be built in each of the asterisked areas


With the help of this ecosystem diagram, we can more fully describe the possibilities for phase 2 Corpora Space applications and their relationship to Work Spaces and the Bamboo shared infrastructure projects.

Again, by “Corpora Space” we mean a set of corpora-centered applications that use the services provided by the Bamboo platform to provide powerful functions to scholars as they curate, interpret, depict, and discuss dispersed corpora. Here are some examples of possible Corpora Space applications:

1. Corpora services within Bamboo Work Spaces. Work Spaces would be developed so that they offer increasingly powerful and specialized corpora management functions. These functions could be provided via the services on the Bamboo platform and built directly into the Work Spaces environments. The corpora services would be exposed via APIs to the Work Spaces.

2. "Pipes" model. Technologists could develop Pipes-based corpora applications on top of the Bamboo Services platform. These applications would be among the “scholarly applications” in the middle column of the diagram above.

3. Refactoring of an existing humanities corpora application. The refactored application could be built on top of the Bamboo platform or as a stand-alone application. In the former case, the application would be among the “scholarly applications” in the middle column above; in the latter, it would be in the “other humanities applications” column.

4. iPad application. With the help of the Bamboo platform’s web services, technologists could build an iPad application as a stand-alone application. This would be among the “other humanities applications.”

5. Corpora Space equivalent of the CollectionSpace Platform. It would be possible for technologists to build such an application platform on top of Bamboo Services platform. In this case, the application would be among the “scholarly applications” in the middle column above.

We wish to emphasize two points here. One is that there is a range of technology options open to us for development of scholarly applications after phase 1, and that these applications can be developed both by the Bamboo partners and, in a number of cases, by others institutions as well; the other point is that the services provided by the Bamboo services platform, along with the interoperability projects, are essential to each option, and will need to be enhanced and sustained to support future applications.

5.2 Definitions and Scope of "Content" and "Corpora"

For the purposes of this project we mean the following when we speak of “content” and “corpora.” By “content” we mean text, images, video, audio, and associated metadata. By “corpora” we mean structured sets of these materials. Structured texts may themselves include associated images, video, and/or audio. The corpora may be stored in one location or made up of the aggregation of distributed materials across digital collections held in libraries, research centers, museums, and/or archives within and outside of universities.

In phase 1, Work Spaces will be able to ingest and store all content types as digital binaries. This capability – equivalent to storing a file on a hard disk drive, without regard to whether or how that file can be manipulated, analyzed, or transformed – is an essential precursor to extended functionality that is useful to a scholar. Examples of extended capabilities include the ability to transform stored content from one format to another (e.g., Word documents to PDF, TIFF images to JPEG, or unstructured text to an indexed object-relational structure à la PhiloLogic); to generate concordances of textual materials; to generate histograms that represent tonal distribution in a digital image; or to collate multiple drafts or editions of a digitized text. In phase 1, Work Spaces will enable annotation, transformation, discussion, and sharing of documents whose principal content is text. Please see section 5.3 immediately below for a more detailed list of these functions. Bamboo will not develop extended functionality for all content types in phase 1.

In phase 1 Bamboo will define the functions for Corpora Space applications to be implemented in phase 2. We expect to focus initially on the needs of scholars who work with text-centered corpora. It is possible — and may be highly desirable — to address some critical and common needs for the curation, analysis, visualization, and presentation of other media types, such as audio and video. We note, though, that handling of content as varied as music recordings and video clips presents complex and challenging issues. Thus we will not make strong claims about functions that can be applied to an extended range of content types prior to performing the phase 1 review and design work for Corpora Space.

The scholarly services to be developed in phase 1 will operate on digitized texts. As we evolve our work in phase 1, we intend to consider services that are of value to the analysis of other digital content types. Certainly a number of our partners bring both expertise and interest in focusing on services that operate on audio and video media. The Corpora Space design process and the capacities and interests of the partners in this project will help to define which additional scholarly web services we explore in phase 1.


5.4 Corpora Space: Phase 1 design

In Phase 1, Corpora Space design work will explore the needs, possibilities, and challenges required for implementing powerful web-based applications for research across multiple and dispersed corpora. This work will require close coordination among all areas of work of in this project. The primary deliverable for the Corpora Space design phase will be detailed roadmaps for building approximately two Corpora Space applications.

The Corpora Space design effort will focus first on people, next on exploration, and last on decision-making in preparation for the Corpora Space implementation phase.

The Corpora Space design effort will begin by recruiting a larger team to participate in the design effort. The current (planning phase) Corpora Space participants (University of Maryland (project lead), Tufts University, University of Oxford, Northwestern University, University of Wisconsin at Madison, University of Chicago, and UC Berkeley) will identify a wide group of individuals and institutions in the humanities who are interested and able to consult about this effort, thus expanding the design team. The team will also consult with consortia (e.g. CLARIN, DARIAH and CHAIN) that can provide insight, guidance and experienced counsel for Corpora Space design.

Once the full design team is assembled, its first order of business will be to leverage what already exists. The team will identify potential corpora, collections, projects and tools for consideration as part of Corpora Space design. The team will develop an appropriate Corpora technology evaluation framework to help establish a common language for assessing existing and future applications. Design activity will include assessment of current corpora-based applications in the humanities and their future development and support plans; exploration and discussion with faculty across disciplines about the use of shared corpora applications; lessons learned from other projects that are attempting to build a common platform for multiple application instances across disciplines and data models (e.g. CollectionSpace in the museum domain); identification and evaluation of corpora that may be ready for Corpora Space (e.g. the University of Virginia and University of Michigan's digital library collections); and technical consultation with other related services environments (e.g., those of CLARIN and SEASR). Examples of ongoing scholarship and scholarly technologies we will consult with include Nines, Perseus, ARTFL, TAPoR, and Oxford’s JISC-supported VRE-SDM. We will also consider more recent projects, such as Berkeley Prosopography Services and its relationship to the Cuneiform Digital Library.

Once the overall parameters and the most promising opportunities have been identified, the design team will recruit 3-6 groups to develop white papers about possible Corpora Space applications. We will request that these white papers take into account opportunities and lessons learned Bamboo’s work in Work Spaces, Bamboo scholarly services and services platform, and collections interoperability. The Corpora design team will review the white papers and, based on technical fit, corpora readiness, broader benefit to the humanities community, and other criteria, recommend one or two applications for implementation during Phase 2. The Bamboo Project Executive Group and Steering Council will then make the final determination regarding which candidates to implement. During this process we will actively share results with the Mellon Foundation and solicit guidance on possible areas of interest.

To complete the design phase, thorough technical and project development roadmaps will be created for the chosen projects. Roadmaps will identify deliverables, technology, corpora, consortial partnerships, resource requirements and a development timeframe. Roadmaps will also include early design elements such as wireframes and data models.