Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Session
R4: Research 4: Machine Learning 1
Time:
Thursday, 06/Mar/2025:
11:00am - 12:30pm

Session Chair: Fabian Panse, Universität Augsburg
Location: WE5/00.022

Lecture Hall 1

Show help for 'Increase or decrease the abstract text size'
Presentations
11:00am - 11:20am

Graph-based QSS: A Graph-based Approach to Quantifying Semantic Similarity for Automated Linear SQL Grading

Leo Köberlein, Dominik Probst, Richard Lenz

Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Germany

Determining the Quantified Semantic Similarity (QSS) between database queries is a critical challenge with broad applications, from query log analysis to automated SQL skill assessment. Traditional methods often rely solely on syntactic comparisons or are limited to checking for semantic equivalence.

This paper introduces Graph-based QSS, a novel graph-based approach to measure the semantic dissimilarity between SQL queries. Queries are represented as nodes in an implicit graph, while the transitions between nodes are called edits, which are weighted by semantic dissimilarity. We employ shortest path algorithms to identify the lowest-cost edit sequence between two given queries, thereby defining a quantifiable measure of semantic distance.

An empirical study of our prototype suggests that our method provides more accurate and comprehensible grading compared to existing techniques. Moreover, the results indicate that our approach comes close to the quality of manual grading, making it a robust tool for diverse database query comparison tasks.

Köberlein-Graph-based QSS-141_b.pdf
Köberlein-Graph-based QSS-141_c.pdf


11:20am - 11:40am

Enhancing In-Memory Spatial Indexing with Learned Search

Varun Pandey1, Alexander van Renen3, Eleni Tzirita Zacharatou4, Andreas Kipf3, Ibrahim Sabek5, Jialin Ding6, Volker Markl1,2, Alfons Kemper7

1Technische Universität Berlin; 2Berlin Institute for the Foundations of Learning and Data (BIFOLD); 3Technische Universität Nürnberg; 4IT University of Copenhagen; 5University of Southern California; 6Amazon Web Services; 7Technische Universität München

Spatial data is generated daily from numerous sources such as GPS-enabled devices, consumer applications (e.g., Uber, Strava), and social media (e.g., location-tagged posts). This exponential growth in spatial data is driving the development of efficient spatial data processing systems.

In this study, we enhance spatial indexing with a machine-learned search technique developed for single-dimensional sorted data. Specifically, we partition spatial data using six traditional spatial partitioning techniques and employ machine-learned search within each partition to support point, range, distance, and spatial join queries. By instance-optimizing each partitioning technique, we demonstrate that:

(i) grid-based index structures outperform tree-based ones (from 1.23x to 2.47x),

(ii) learning-enhanced spatial index structures are faster than their original counterparts (from 1.44x to 53.34x),

(iii) machine-learned search within a partition is 11.79% - 39.51% faster than binary search when filtering on one dimension,

(iv) the benefit of machine-learned search decreases in the presence of other compute-intensive operations (e.g. scan costs in higher selectivity queries, Haversine distance computation, and point-in-polygon tests), and

(v) index lookup is the bottleneck for tree-based structures, which could be mitigated by linearizing the indexed partitions.

Pandey-Enhancing In-Memory Spatial Indexing with Learned Search-137_b.pdf
Pandey-Enhancing In-Memory Spatial Indexing with Learned Search-137_c.pdf


11:40am - 12:00pm

Automated Configuration of Schema Matching Tools: A Reinforcement Learning Approach

Michał Patryk Miazga1, Daniel Abitz1,2, Matthias Täschner1, Erhard Rahm1,3

1Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, Universität Leipzig, Germany; 2University Computing Centre, Research and Development, Leipzig University, 04109 Leipzig, Dittrichring 18-20, Germany; 3Department of Computer Science, Leipzig University, Leipzig, Germany

Schema matching involves identifying matching relationships between different data schemas. While this task is supported by semi-automatic tools, achieving optimal results requires configuring such tools which can be challenging depending on the number of configuration options and schema characteristics. This study proposes a novel approach utilizing Reinforcement Learning (RL) to automate the configuration of schema matching tools. RL has proven to be well-suited for complex optimization problems but has not yet been applied for schema matching. We outline how the configuration of a schema matching tool can be tackled as an RL task and how the corresponding learning process can be accelerated and optimized. We evaluate the RL approach for a large real-world dataset and show that it can be applied to different matching tools.

Miazga-Automated Configuration of Schema Matching Tools-124_b.pdf


12:00pm - 12:20pm

Scalable Computation of Shapley Additive Explanations

Louis Le Page, Christina Dionysio, Matthias Boehm

Technische Universität Berlin, Germany

The growing field of explainable AI (XAI) develops methods that help better understand ML

model predictions. While SHapley Additive exPlanations (SHAP) is a widely-used, model-agnostic method for explaining predictions, its use comes with a significant computational burden, particularly for complex models and large datasets with many features. The key—and so far unaddressed—challenge lies in efficiently scaling these computations without compromising accuracy. In this paper, we present a scalable, model-agnostic SHAP sampling framework on top of Apache SystemDS. We leverage Antithetic Permutation Sampling for its efficiency and optimization potential, and we devise a carefully vectorized and parallelized implementation for local and distributed operations. Compared with the state-of-the-art Python shap package, our solutions yield similar accuracy but achieve significant speedups of up to 14x for multi-threaded singlenode operations as well as up to 35x for distributed Spark operations (on a small 8 node cluster).

Le Page-Scalable Computation of Shapley Additive Explanations-142_b.pdf
Le Page-Scalable Computation of Shapley Additive Explanations-142_c.zip


12:20pm - 12:30pm

Extending K-Means Clustering with Ptolemy’s Inequality

Max Pernklau, Nikita Averitchev, Christian Beecks

FernUniversität in Hagen, Germany

Clustering is a fundamental data analytics operation in the field of unsupervised learning. Given a database of unknown structure, clustering aims to discover the inherent structure of the data objects according to similarity, such that similar data objects are grouped together, while dissimilar ones are separated in different groups or clusters.

In this study, we focus on partitioning methods, specifically the k-means clustering approach, which minimizes the intra-cluster variance.

As the standard algorithmic approach to k-means clustering, namely the Lloyd algorithm, is neither efficient nor scalable, various adaptations and modifications have been developed, resulting in the family of fast k-means clustering algorithms. In this short paper, we extend the clustering algorithm proposed by Elkan with Ptolemy's inequality to prune superfluous distance calculations. It is not our intention with this paper to compete with the state-of-the-art algorithmic solutions for the k-means clustering problem; rather, we seek to investigate the potential of Ptolemy’s inequality to further enhancements in the standard algorithm’s performance, particularly in terms of reducing computations associated with distance evaluations.

Pernklau-Extending K-Means Clustering with Ptolemy’s Inequality-166_b.pdf
Pernklau-Extending K-Means Clustering with Ptolemy’s Inequality-166_c.zip