Working around Walled Gardens: The Princeton Prosody Archive as Workflow
Meredith Martin, Rebecca Sutton Koeser, Mary Katherine Naydan
Princeton University, United States of America
The Princeton Prosody Archive (PPA) is an open-source, full-text searchable database of 6,000+ English-language digitized works about the study of poetry, versification and pronunciation. But the PPA is also — technically and conceptually — a workflow that brings together materials from both HathiTrust Digital Library and Gale/Cengage’s Eighteenth Century Collections Online into one searchable interface (figure 1). This workflow is necessary because today’s digital research landscape consists of silos or “walled gardens” of scholarly materials controlled by proprietary vendors and digital libraries with layers of copyright restrictions. Designed to keep scholars inside them, these “walled gardens” limit scholars’ ability to work across collections. What if scholars could analyze materials from multiple collections at once? What discoveries could scholars make if they weren’t limited by a particular vendor’s metadata, OCR quality, indexing practices, or built-in Natural Language Processing tools? [1]
The PPA has asked and answered these questions over a sixteen-year process of negotiating with vendors, collecting and curating materials, building the public-facing web application, and making the data usable for computational analysis. Our short paper will discuss how we gained access to additional levels of data by securing Memoranda of Understandings from HathiTrust and Gale/Cengage and worked within their data infrastructures to pull in bibliographic metadata, full-text OCR, and page images from their APIs and servers. Although our work improved their infrastructures and data, our workflows did not always align: for instance, we waited over a year in Gale’s development queue for minor yet necessary enhancements to their API to access basic metadata. Similarly, we spent years correcting HathiTrust’s inaccurate metadata for hundreds of works but could not feed those corrections back into HathiTrust because of HathiTrust’s understandably limited ability to develop individual workflows with each partner research library.
The roadblocks we encountered while trying to integrate our workflows starkly illustrate how Big Tech and for-profit companies circumscribe how scholars conduct academic research, even on public-domain materials. For instance, despite HathiTrust being a not-for-profit collaborative of academic and research libraries that own the physical copies of the digitized works, we needed special permission from Google to display page image thumbnails because Google digitized 95% of HathiTrust works in the mid-2000s [2]. We had to prove that our archive of highly specialized texts about prosody — miniscule compared to HathiTrust’s 18+ million volume collection (see figure 2) — wouldn’t compete with Google Books! This example is one of many that exposes the fragility of the current research ecosystem, in which scholars encounter often invisible limitations around who can access these materials, how they can be used, and even which materials get included in the first place [3].
The PPA imagines an ideal scholarly information landscape in which the curation of data is driven by research questions, rather than by profit or by who happens to have digitized what. Rather than developing new resources from scratch, we imagine sharing workflows, ideas, and code with other scholars to remove barriers and work across — rather than just within — our research library’s subscriptions.
Consolidating the heterogeneous landscape of literary corpora
Ingo Börner2, Vera Maria Charvat1, Matej Ďurčo1, Adrián Ghajari3, Lukas Plank1, Salvador Ros3
1Austrian Centre for Digital Humanities and Cultural Heritage (ACDH-CH), Austrian Academy of Sciences (OEAW), Austria; 2University of Potsdam, Germany; 3Spanish National Distance Education University (UNED), Spain
The Computational Literary Studies Infrastructure (CLS Infra) project aims to consolidate the heterogeneous landscape of data and tools in the CLS domain, with a specific focus on literary corpora. This contribution describes a data integration task against the background of developing a bespoke data model.
As part of the project (D6.1, Ďurčo et al., 2022), a metadata inventory of literary corpora was compiled to provide an overview of existing datasets, including their formats and mode of access. Based on the information derived from this initial collection of metadata and the Metamodel for Corpus Metadata (MKM; Odebrecht, 2018) as a conceptual starting point, the CLSCor ontology has been developed within the project as a unifying conceptual model to describe various aspects of literary texts. It introduces the basic entities: Corpus, Corpus Document and Feature, Feature being a generic mechanism to capture any specific characteristics of manifestations of literary works or collections thereof. These can be structural or semantic phenomena, like paragraphs, distinct speakers in a drama or verses in a poem. To guarantee its interoperability, the CLSCor ontology is based on CIDOC CRM and its extensions CRMdig and LRMoo. The CLSCor ontology will be accompanied by a set of controlled vocabularies (modeled in SKOS) that are currently in development.
In order to validate the applicability of the proposed ontology, data from three distinct datasets covering main literary genres were processed: novels (ELTeC), drama (DraCor), and poetry (POSTDATA). Not only do these sample datasets represent the different genres, but also three different underlying data structures and solutions for providing the data, making them ideal as proof of concept. The general workflow comprised mapping the respective source models to the CLSCor data model, transforming the data into RDF using custom scripts and merging the converted datasets into one consolidated knowledge graph. Given the distinct ways these three datasets were offered, a custom solution was implemented for each:
ELTeC data, exposed as TEI files in a github repository, was obtained by accessing the Github API.
Relevant information was extracted using XPath expressions and transformed into CLSCor-compliant RDF via a Python script.
Data from DraCor – TEI encoded corpora of plays – was integrated by transforming the JSON data returned by the DraCor API to RDF with a custom Python script.
In the case POSTDATA, the legacy tool "Horace", originally used in the project's data generation workflow, has been adapted to generate RDF data conforming to the CLSCor ontology.
After the milestone of merging three distinct sources into one graph, the graph is being evaluated for coherence, consistency and usability for exploration via a discovery application. The outcomes of this evaluation will inform further iterations and the integration of further sources.
Next to a rich catalogue of literary corpora, another envisioned outcome is robust conversion workflows for a variety of data sources and formats, accompanied by tutorials and promoted in training events to foster their reuse.
From NEWW to SHEWROTE: Developing a workflow to retroactively document a research dataset lifecycle and to ensure future data sustainability
Alicia Montoya1, Amelia Sanz2
1Radbout University, The Netherlands; 2Complutense University of Madrid, Spain
Between 1997 and 2023, a network of 100+ scholars across Europe worked collaboratively, in a series of funded and unfunded projects, to produce a corpus of highly granular data on the reception of women authors from the Middle Ages to circa 1940. Embracing almost 7,000 women writers and 30,000 individual receptions of their works or persons, the data thus produced was successively stored in different databases: a first one developed at Utrecht University (1997-2014), and a second one, NEWW, built and hosted by the KNAW-Huygens Institute until 2023. At present, our WWIH DARIAH WG is developing a completely new, third iteration of the database, SHEWROTE, in partnership with the Humanities Lab of Radboud University (The Netherlands).
The redevelopment of the database has entailed a complex process of a) documenting the history of the database’s previous iterations, particularly (implicit) decisions made in data harvesting and enrichment, b) rethinking the database’s basic concepts and research questions, and formulating requirements in line with the research focuses of a new generation of scholars, and c) designing a new datamodel that is responsive to current needs, and adaptable to future needs, especially in data interconnectivity and sustainability. This conceptual rethinking preceded the work, which has now begun, of retooling and crosswalking data from the old (now archived) version of the database to the new one.
This paper presents some of the challenges faced so far in redeveloping the database. These have included, first, the move from a non-relational to a relational database structure, and designing a new datamodel that introduces several additional fields. Thus, the addition of the FRBR fields of Work and Manifestation will allow us to link data to existing identifiers, thereby prepping the data for data-linking with databases such as Cambridge’s Orlando database, in the future. By introducing more granular place data, we anticipate mapping functionality and visualizations, to reflect new research focuses on the physical movement of texts and their authors in colonial and other settings. Further, by including alternative names, alternativ.e genders (for pseudonyms) and also dates during which alternative names (pseudonyms, married names) were used, we are making the existing Person data more granular, linking it to VIAF identifiers and creating new CERL identifiers when necessary, thus contributing to other collaborative datasets and to the broader (bibliographic) research community. Throughout this process, we strive to make decisions taken in the past explicit, documenting these systematically, and thereby retroactively making the tacit knowledge of generations of scholars, working in very different technical and institutional settings, compliant with modern FAIR data principles. Finally, we discuss the key issue of credits according to COARA Guiding Principles: how to document the contribution of the dozens of scholars, students and developers who contributed their labor and expertise to the database , as record creators and editors, and in other capacities, and how to acknowledge different levels and kinds of involvement in a manner that does justice to the complexities of digital, collaborative workflows and the research data lifecycle.
|