21st Conference on Database Systems for Business, Technology and Web (BTW 2025)

9:00am - 9:20am

Incremental SliceLine for Iterative ML Model Debugging under Updates

Frederic Caspar Zoepffel, Christina Dionysio, Matthias Boehm

TU Berlin + BIFOLD, Germany

SliceLine is a model debugging technique for finding the top-k worst slices (in terms of conjunctions of attributes) where a trained machine learning (ML) model performs significantly worse than on average. In contrast to other slice finding techniques, SliceLine introduced an intuitive scoring function, effective pruning strategies, and fast linear-algebra-based evaluation strategies. Together, SliceLine is able to find the exact top-K worst slices in the full lattice of possible conjunctions in reasonable time. Recently, we observe an increasing trend towards iterative algorithms that incrementally update the dataset (e.g., selecting samples, augmentation with new instances). Fully computing SliceLine from scratch for every update is unnecessarily wasteful. In this paper, we introduce an incremental problem formulation of SliceLine, new pruning strategies that leverage intermediates of previous slice finding runs on a modified dataset, and an extended linear-algebra-based enumeration algorithm. Our experiments show that incremental SliceLine yields robust performance improvements up to an order of magnitude faster than full SliceLine, while still allowing effective parallelization in local, distributed, and federated environments.

9:20am - 9:40am

Assessing the Impact of Image Dataset Features on Privacy-Preserving Machine Learning

Lucas Lange, Maurice-Maximilian Heykeroth, Erhard Rahm

Leipzig University & ScaDS.AI Dresden/Leipzig, Germany

Machine Learning (ML) is crucial in many sectors, including computer vision. However, ML models trained on sensitive data face security challenges, as they can be attacked and leak information. Privacy-Preserving Machine Learning (PPML) addresses this by using Differential Privacy (DP) to balance utility and privacy. This study identifies image dataset characteristics that affect the utility and vulnerability of private and non-private Convolutional Neural Network (CNN) models. Through analyzing multiple datasets and privacy budgets, we find that imbalanced datasets increase vulnerability in minority classes, but DP mitigates this issue. Datasets with fewer classes improve both model utility and privacy, while high entropy or low Fisher Discriminant Ratio (FDR) datasets deteriorate the utility-privacy trade-off. These insights offer valuable guidance for practitioners and researchers in estimating and optimizing the utility-privacy trade-off in image datasets, helping to inform data and privacy modifications for better outcomes based on dataset characteristics.

Conference Agenda