Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
ACloudDM 2: Workshop on Advances in Cloud Data Management 2
Time:
Tuesday, 04/Mar/2025:
11:00am - 12:30pm
Location:WE5/00.019
Lecture Hall 2
Session Abstract
11:00 – 11:20 Fabian Hueske Confluent
11:20 – 11:50 Andreas Kipf TU Nürnberg
11:50 – 12:10 Tomas Karnagel Observe
12:10 – 12:30 Panos Parchas AWS Redshift
Presentations
Preparing Data for Analytics: Exploring Modern Approaches to Data Pipelines
Fabian Hüske
Confluent, Germany
Cloud data warehouses and data lakes power the analytical workloads of many enterprises. These systems store vast amounts of data, generated by external sources, that must be ingested before they are ready for querying. The ingested data typically requires cleaning, transformation, enrichment, integration, and aggregation to ensure it is in the right format for effective analysis.
Given the scale of data being processed, the transformation engines responsible for these tasks must offer high throughput while maintaining cost efficiency. Furthermore, low-latency processing is important for meeting the demands of many real-time use cases.
Different architectures exist for implementing data pipelines that perform these transformations. Some, like Snowflake's Dynamic Tables, rely on periodic batch processing, while others leverage stateful stream processing engine such as Apache Flink. In this talk, we discuss different data pipeline architectures and analyze their strengths, limitations, and trade-offs.
Workload-Driven Indexing in the Cloud
Andreas Kipf
University of Technology Nuremberg, Germany
In this talk, I will present predicate caching, a lightweight secondary indexing mechanism for cloud data warehouses. Specifically, I will show that workloads are highly repetitive, i.e., users and systems frequently send the same queries. To improve query performance on such workloads, most systems rely on techniques like result caching or materialized views. However, these caches are often stale due to inserts, deletes, or updates that occur between query repetitions. Predicate caching, on the other hand, improves query latency for repeating scans and joins in a lightweight manner, by simply storing ranges of qualifying tuples. Such an index can be built on the fly and can be kept online without recomputation. We implemented a prototype of this idea in the cloud data warehouse Amazon Redshift. Our evaluation shows that predicate caching improves query runtimes by up to 10x on selected queries with negligible build overhead.
OBSERVE - Petascale Streaming for Observability
Tomas Karnagel
Observe, Switzerland
Observe brings together petabyte-per-day streaming ingest, relational analytics, search, and real-time monitoring capabilities under one product. Observe was built to enable all types of observability workloads — logs, metrics, traces, application performance, and security — as well as complex business data analytics, over a single connected data lake. The platform is powered by the Snowflake Data Cloud supporting our hundreds of millions of queries per day. In this talk we will give an overview of the architecture and capabilities of the Observe platform.
Query acceleration via auto-tuning in Amazon Redshift
Panos Parchas
AWS, Germany
Amazon Redshift is the first fully-managed, petabyte-scale, enterprise-grade cloud data warehouse that revolutionized the data warehousing industry. During the last decade, the Redshift team is constantly innovating by extending the functionality and improving the efficiency of the system. A large focus area has been "ease of use" that targets on auto-tuning and ML techniques to make the system more performant for the unique characteristics of individual workloads. This talk provides an overview of Redshift's architecture, focusing on query processing. Within this context, we discuss techniques that the team has developed during the past couple of years for query acceleration and we dive deep into our novel data distribution and data layout schemes.