21st Conference on Database Systems for Business, Technology and Web (BTW 2025)

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Session

R4: Research 4: Machine Learning 1

Time:

Thursday, 06/Mar/2025:

11:00am - 12:30pm

Session Chair: Fabian Panse, Universität Augsburg

Location: WE5/00.022

Lecture Hall 1

Presentations

11:00am - 11:20am

Graph-based QSS: A Graph-based Approach to Quantifying Semantic Similarity for Automated Linear SQL Grading

Leo Köberlein, Dominik Probst, Richard Lenz

Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany

Determining the Quantified Semantic Similarity (QSS) between database queries is a critical challenge with broad applications, from query log analysis to automated SQL skill assessment. Traditional methods often rely solely on syntactic comparisons or are limited to checking for semantic equivalence.

This paper introduces Graph-based QSS, a novel graph-based approach to measure the semantic dissimilarity between SQL queries. Queries are represented as nodes in an implicit graph, while the transitions between nodes are called edits, which are weighted by semantic dissimilarity. We employ shortest path algorithms to identify the lowest-cost edit sequence between two given queries, thereby defining a quantifiable measure of semantic distance.

An empirical study of our prototype suggests that our method provides more accurate and comprehensible grading compared to existing techniques. Moreover, the results indicate that our approach comes close to the quality of manual grading, making it a robust tool for diverse database query comparison tasks.

Köberlein-Graph-based QSS-141_b.pdf

Köberlein-Graph-based QSS-141_c.pdf

11:20am - 11:40am

Enhancing In-Memory Spatial Indexing with Learned Search

Varun Pandey¹, Alexander van Renen³, Eleni Tzirita Zacharatou⁴, Andreas Kipf³, Ibrahim Sabek⁵, Jialin Ding⁶, Volker Markl^1,2, Alfons Kemper⁷

¹Technische Universität Berlin; ²Berlin Institute for the Foundations of Learning and Data (BIFOLD); ³Technische Universität Nürnberg; ⁴IT University of Copenhagen; ⁵University of Southern California; ⁶Amazon Web Services; ⁷Technische Universität München

Spatial data is generated daily from numerous sources such as GPS-enabled devices, consumer applications (e.g., Uber, Strava), and social media (e.g., location-tagged posts). This exponential growth in spatial data is driving the development of efficient spatial data processing systems.

In this study, we enhance spatial indexing with a machine-learned search technique developed for single-dimensional sorted data. Specifically, we partition spatial data using six traditional spatial partitioning techniques and employ machine-learned search within each partition to support point, range, distance, and spatial join queries. By instance-optimizing each partitioning technique, we demonstrate that:

(i) grid-based index structures outperform tree-based ones (from 1.23x to 2.47x),

(ii) learning-enhanced spatial index structures are faster than their original counterparts (from 1.44x to 53.34x),

(iii) machine-learned search within a partition is 11.79% - 39.51% faster than binary search when filtering on one dimension,

(iv) the benefit of machine-learned search decreases in the presence of other compute-intensive operations (e.g. scan costs in higher selectivity queries, Haversine distance computation, and point-in-polygon tests), and

(v) index lookup is the bottleneck for tree-based structures, which could be mitigated by linearizing the indexed partitions.

Pandey-Enhancing In-Memory Spatial Indexing with Learned Search-137_b.pdf

Pandey-Enhancing In-Memory Spatial Indexing with Learned Search-137_c.pdf

11:40am - 12:00pm

Automated Configuration of Schema Matching Tools: A Reinforcement Learning Approach

Michał Patryk Miazga¹, Daniel Abitz^1,2, Matthias Täschner¹, Erhard Rahm^1,3

¹Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, Universität Leipzig, Germany; ²University Computing Centre, Research and Development, Leipzig University, 04109 Leipzig, Dittrichring 18-20, Germany; ³Department of Computer Science, Leipzig University, Leipzig, Germany

Schema matching involves identifying matching relationships between different data schemas. While this task is supported by semi-automatic tools, achieving optimal results requires configuring such tools which can be challenging depending on the number of configuration options and schema characteristics. This study proposes a novel approach utilizing Reinforcement Learning (RL) to automate the configuration of schema matching tools. RL has proven to be well-suited for complex optimization problems but has not yet been applied for schema matching. We outline how the configuration of a schema matching tool can be tackled as an RL task and how the corresponding learning process can be accelerated and optimized. We evaluate the RL approach for a large real-world dataset and show that it can be applied to different matching tools.

Miazga-Automated Configuration of Schema Matching Tools-124_b.pdf

12:00pm - 12:20pm

Scalable Computation of Shapley Additive Explanations

Louis Le Page, Christina Dionysio, Matthias Boehm

Technische Universität Berlin, Germany

The growing field of explainable AI (XAI) develops methods that help better understand ML

model predictions. While SHapley Additive exPlanations (SHAP) is a widely-used, model-agnostic method for explaining predictions, its use comes with a significant computational burden, particularly for complex models and large datasets with many features. The key—and so far unaddressed—challenge lies in efficiently scaling these computations without compromising accuracy. In this paper, we present a scalable, model-agnostic SHAP sampling framework on top of Apache SystemDS. We leverage Antithetic Permutation Sampling for its efficiency and optimization potential, and we devise a carefully vectorized and parallelized implementation for local and distributed operations. Compared with the state-of-the-art Python shap package, our solutions yield similar accuracy but achieve significant speedups of up to 14x for multi-threaded singlenode operations as well as up to 35x for distributed Spark operations (on a small 8 node cluster).

Le Page-Scalable Computation of Shapley Additive Explanations-142_b.pdf

Le Page-Scalable Computation of Shapley Additive Explanations-142_c.zip

12:20pm - 12:30pm

Extending K-Means Clustering with Ptolemy’s Inequality

Max Pernklau, Nikita Averitchev, Christian Beecks

FernUniversität in Hagen, Germany

Clustering is a fundamental data analytics operation in the field of unsupervised learning. Given a database of unknown structure, clustering aims to discover the inherent structure of the data objects according to similarity, such that similar data objects are grouped together, while dissimilar ones are separated in different groups or clusters.

In this study, we focus on partitioning methods, specifically the k-means clustering approach, which minimizes the intra-cluster variance.

As the standard algorithmic approach to k-means clustering, namely the Lloyd algorithm, is neither efficient nor scalable, various adaptations and modifications have been developed, resulting in the family of fast k-means clustering algorithms. In this short paper, we extend the clustering algorithm proposed by Elkan with Ptolemy's inequality to prune superfluous distance calculations. It is not our intention with this paper to compete with the state-of-the-art algorithmic solutions for the k-means clustering problem; rather, we seek to investigate the potential of Ptolemy’s inequality to further enhancements in the standard algorithm’s performance, particularly in terms of reducing computations associated with distance evaluations.

Pernklau-Extending K-Means Clustering with Ptolemy’s Inequality-166_b.pdf

Pernklau-Extending K-Means Clustering with Ptolemy’s Inequality-166_c.zip

Conference Agenda