Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
DE4DS 3: Workshop on Data Engineering for Data Science 3
Time:
Tuesday, 04/Mar/2025:
1:30pm - 3:00pm

Session Chair: Tanja Auge, University of Regensburg
Location: WE5/00.022

Lecture Hall 1

Show help for 'Increase or decrease the abstract text size'
Presentations
1:30pm - 1:50pm

Towards Complex Table Question Answering Over Tabular Data Lakes

Daniela Risis1,2, Jan-Micha Bodensohn1,2, Matthias Urban2, Carsten Binnig2,1

1DFKI; 2Technische Universität Darmstadt

Natural Language Interfaces for Databases (NLIDBs) are an interesting alternative to SQL since they empower non-experts to query data. However, NLIDB approaches require this data to first be integrated into a database schema, which causes high upfront data engineering and integration overheads. As such, Open Table Question Answering (OTQA) is promising, since it allows directly querying the data in data lakes without needing to incorporate it into a schema. Many recent OTQA approaches combine Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs), where relevant tables are first retrieved from a data lake and then used as input to an LLM to answer the user query. In this paper, we take the first systematic step for investigating how LLMs paired with table retrievers can answer queries over private tabular data lakes. As a main finding, we see that even when tuning several parameters of this approach, current LLMs still fail to answer queries that focus on the simple extraction of individual cell values, let alone aggregate queries. Thus, they are far from the rich querying capabilities that NLIDB approaches offer today. To solve this, we recommend promising future work to enable complex question answering over tabular data lakes.

Risis-Towards Complex Table Question Answering Over Tabular Data Lakes-231_a.pdf


1:50pm - 2:15pm

The Kosmosis Use Case of Crypto Rug Pull Prevention by an Incrementally Constructed Knowledge Graph

Philipp Stangl1, Christoph Peter Neumann2

1Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany; 2Ostbayerische Technische Hochschule Amberg-Weiden, Amberg, Germany

Current methods to prevent crypto asset fraud are based on the analysis of transaction graphs within blockchain networks.

While effective for identifying transaction patterns indicative of fraud, it does not capture the semantics of transactions and is constrained to blockchain data.

Consequently, preventive methods based on transaction graphs are inherently limited.

In response to these limitations, we propose the Kosmosis approach, which aims to incrementally construct a knowledge graph as new blockchain and social media data become available. During construction, it aims to extract the semantics of transactions and connect blockchain addresses to their real-world entities by fusing blockchain and social media data in a knowledge graph. This enables novel preventive methods against rug pulls as a form of crypto asset fraud.

To demonstrate the effectiveness and practical applicability of the Kosmosis approach, we examine a series of real-world rug pulls. Through this case, we illustrate how Kosmosis can aid in identifying and preventing such fraudulent activities by leveraging the insights from the constructed knowledge graph.

Stangl-The Kosmosis Use Case of Crypto Rug Pull Prevention-222_a.pdf


2:15pm - 2:40pm

Self-Refinement Strategies for LLM-based Product Attribute Value Extraction

Alexander Brinkmann, Christian Bizer

University of Mannheim, Germany

Structured product data, represented as attribute-value pairs, is crucial for e-commerce platforms to enable features such as faceted product search and attribute-based product comparison. However, vendors often supply unstructured product descriptions, necessitating attribute value extraction to ensure data consistency and usability.

Large language models (LLMs), including OpenAI's GPT-4o, have demonstrated their potential for product attribute value extraction in few-shot scenarios. Recent research has shown that self-refinement techniques can improve the performance of LLMs on tasks such as code generation and text-to-SQL translation. For other tasks, applying these techniques has only led to increased costs due to the processing of additional tokens, without achieving an improved performance.

This paper investigates applying two self-refinement techniques — error-based prompt rewriting and self-correction — to the product attribute value extraction task. The self-refinement techniques are evaluated across zero-shot, few-shot in-context learning, and fine-tuning scenarios. Experimental results reveal that both self-refinement techniques have a marginal impact on the performance of GPT-4o across the different scenarios while significantly increasing processing costs. For attribute value extraction scenarios involving training data, fine-tuning yields the highest performance while the ramp-up costs of fine-tuning are balanced out as the amount of product descriptions grows.

Brinkmann-Self-Refinement Strategies for LLM-based Product Attribute Value Extraction-240_a.pdf


 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: BTW 2025 Bamberg
Conference Software: ConfTool Pro 2.6.153+TC
© 2001–2025 by Dr. H. Weinreich, Hamburg, Germany