Conference Agenda

Session
Organizing Knowledge
Time:
Wednesday, 19/June/2024:
2:00pm - 3:30pm

Session Chair: Agiatis Benardou, DARIAH-EU
Location: Auditorium B3, 5th floor

Auditorium B3, 5th floor, Avenida de Berna 26C, Berna Campus, NOVA FCSH, Lisbon, Portugal

Presentations

A Digital Workflow for Transforming Unstructured Text from Humanities Publications into a Scholarly Knowledge Graph

Vayianos Pertsas, Sofia Nasopoulou, Marianna Tsigkou

Athens university of economics and business, Greece

The significant surge in research publications across disciplines poses a growing challenge for experts in maintaining a comprehensive understanding of their respective fields [1]. Especially in multidisciplinary fields like Digital Humanities, it becomes notably harder to retrieve and interconnect knowledge, particularly for cases of unstructured text from OCRed papers. This situation calls for new “strategic-reading” methods that transform the essence of knowledge encoded in textual form into structured formats like Knowledge Graphs (KG), thus changing the way researchers engage with literature [2]. In this paper we present a digital workflow for creating such a KG from Humanities’ publications through the identification, extraction and interrelation of entities deriving from Scholarly Ontology [3], specifically designed for documenting scholarly work. A specialization of SO for DH is known as NeMO [4].

We focus on four types of entities: 1) Activity, noun phrases denoting research processes like archeological excavations, surveys, experiments etc.; 2) Finding, denoting the results of the above activities; 3) Reference, spans representing the individual entries within the article's bibliography; 4) CitationPointer, textual elements within text that refer to an entry in the article's reference list. Figures 1,2&3 show indicative examples.

Initially the text is segmented into sentences using spaCy[1]. For the entity extraction task we employ four RoBERTa-base Transformers from Hugging-Face library which we fine-tune using two manually annotated datasets: one with sentences from articles’ main text containing annotations of Activities, Findings and CitationPointers and another with entire pages containing annotations of References. Both were randomly sampled from 25,681 OCRed articles of Humanities disciplines such as Archeology, Paleontology, Anthropology, etc., years 2000-2021, retrieved from JSTOR repository. Datasets’ characteristics are presented in Tables 1&2.

Evaluation results measured in Precision, Recall and F1 are presented in Table 3. Error analysis showed that performance depends on the complexity of the textual spans: Activities impose higher lexico-syntactic variation compared to Findings, while References and CitationPointes follow very specific and easily recognizable patterns.

The extracted entities are then interrelated via resultsIn(Activity,Finding) and indicates(CitationPointer,Reference) relations, using inference rules based on the collocation of the entities in the same sentence and regular expressions respectively. All interrelated entities are associated with their source sentence, publication’s metadata and further transformed into URIs, yielding the KG as RDF triples. The entire workflow is shown in Figure 4.

Constructing KGs from academic publications is an area of active research. Works like [5,6,7,8] focus on integrating bibliographic content, whereas research in [9,10,11] applies off-the-shelf NER models to extract entities, like materials and tasks, from scientific papers. While these efforts deal with already structured articles and standard named entities, our workflow enables the extraction of semantically complex (and of variable length) concepts like research activities and findings and deals with unstructured -OCRed- text, while focusing on the domain of Humanities research. The produced KG allows for answering complex queries such as: “find all the references of the articles containing activities that deal with 3D Reconstruction” or “retrieve the findings of the articles that have as reference a specific DOI”.

[1] https://spacy.io/



Unveiling the Veil: reflecting on the omission of workflows in Digital Humanities research projects and their implications for reproducibility

João Oliveira1, Constança Almeida1, Filipa Magalhães2

1NOVA University of Lisbon, School of Social Sciences and Humanities; 2Centre for the Study of the Sociology and Aesthetics of Music (CESEM), NOVA University of Lisbon, School of Social Sciences and Humanities

To what extent are workflows documented, accessible, and comprehensible? If achieving reproducibility is a structural objective, why are the definition, clarity, and explanatory fluidity of Digital Humanities project workflows not readily available or occasionally, inaccessible to the interested scientific community? How significantly will this issue impact the replicability of research project results? When analysing these issues and comparing it with research projects in the so-called ‘exact sciences’, it appears that the majority of research projects in Digital Humanities omit their workflows. This research delves into whether the current state of Digital Humanities projects conforms to the global objectives of Open Science, Open Data and FAIR principles. It examines whether there is any misalignment compromising reproducibility and replicability of the results.

Weiland (2018) defines workflows as a deliberate or rational organization of any purposeful activity, typically specifying its steps in the form of a process directed at a particular result. Therefore, to address our concerns about the omission of workflows in Digital Humanities research projects, we sought to discuss this definition through an in-depth qualitative analysis of some case studies.

To achieve this, we looked at three interdisciplinary projects ranging from contemporary performing arts to digital environmental history: “ERC Project: From Stage to Data, the Digital Turn of Contemporary Performing Arts Historiography (STAGE)”; “In Search of the Drowned in the Words: Testimonies and Testimonial Fragments of the Holocaust” and “The Making of a Forest: landscape change at the Argentine-Brazilian border, 1953-2017”.

Preliminary findings suggest that while digital humanities projects often articulate overarching principles of Open Science, FAIR principles, and general methodology the overall granularity of workflows required for enhancing reproducibility is often lacking. Decisions made at critical junctures of the research process, nuances in information manipulation, and the specific configurations of digital tools employed need to be enhanced. (Liu et al., 2017) This raises concerns regarding the reproducibility and replicability of the knowledge generated within the Digital Humanities domain as a direct consequence of the lack of workflow evidence.

We consider the need to deepen the disparity in the advancement of Open Science and similar principles amidst the underdeveloped state of workflows in Digital Humanities research projects. By implementing clearer workflows, the replicability of results would become evident and could be more readily integrated into various ongoing projects. The clarification of individual research decisions, information handling, and exploration sources throughout the project lifecycle lacks a comprehensive definition, posing challenges for e-science scholars in method development based on previously used methods in research projects (Antonijevic & Cahoy, 2018).

In this paper, we propose to reflect on the importance of making information about workflows accessible and comprehensible through documentation, giving concrete examples from the case studies analysed.



Guarding accessibility - AI supported ontology engineering in the context of the MuseIT repository design

Vyacheslav Tykhonov1, Nasrine Olson2, Moa Johansson3, Kim Ferguson1, Nitisha Jain4, Lloyd May5, Andrea Scharnhorst1

1Data Archiving and Networked Services, Royal Netherlands Academy of Arts and Science, Netherlands, The; 2University of Boras, Sweden; 3ShareMusic, Sweden; 4Kings College London; 5Dartmouth College, Stanford University

The MuseIT project explores new technologies that facilitate and widen access to cultural assets and enables the co-creation of them. It builds upon the UN Universal Declaration of Human Rights and the Convention on the Rights of Persons with Disabilities, both proclaim human rights for all people, including people with disabilities. One of the key outcomes of MuseIT is to set up a content repository capable of indexing, storing and retrieving multisensory, multi-layered digital representation of cultural assets respecting FAIR principles and long term preservation. Dataverse is the envisioned platform to be used. This paper zooms into one important step: namely the creation of specific knowledge organisation systems suitable for the later archiving process. Based on ontology engineering we look for new ways to describe accessibility facets of artefacts in the MuseIT repository. We use the term accessibility here in the context of disability, following the Wikipedia definition that “accessibility is the design of products, devices, services, vehicles, or environments so as to be usable by people with disabilities”.

If it comes to categorising disability we depart from the International Classification of Functioning, Disability and Health (ITC) released by the World Health Organisation. As in its name, this classification provides a medical view on disability. Additionally, we also use Wikipedia.EN (or a collection of web resources (MuseIT disability collection) stored as part of the Now.Museum (Vion-Dury et al. 2023) on a dataverse platform). In short, material which represents a more cultural view on a phenomena (e.g., Salah et al. 2012).

In this paper, we present a new innovative knowledge engineering solution (EverythingData) through which ‘proto-ontologies’ are created by combining Large Language Models with structured data in the form of Knowledge Graphs (Pan et al 2024). It is schematically depicted in the figure below, and in essence relies on a set of APIs. Those enable workflow as such: (a) start with a term (group of terms) for instance ‘disability’ from the ITC (b) identify a related wikipedia (eng) page (or other texts) (c ) feed these texts into a LLM instance and receive a group of terms and their relations which in essence represent a proto-ontology (d) iterate this process and (e) evaluate the variants of a proto-ontology by a group of experts.

Deeply concerned with issues of equality, democratisation and social inclusion, MuseIT researchers are aware of the power of knowledge organisation when it comes to documenting content and organising access. Knowledge organisation systems (KOS) in all their forms (classifications, thesauri, ontologies etc.) are a double-edged sword. They make things visible but they also represent a certain normative stance (Bowker, Star 2000). With experts we will organise the human evaluation of those proto-ontologies to select those set of terms and relations which best suit a rich description of cultural assets with respect of their consumption by people with impediments, while at the same time also to give credit to the variety of producers of cultural assets, acknowledging specific conditions under which such a creation took place.