What question did this study set out to answer?

The research aims to develop an automated optimizer for table layouts in lakehouse systems, focusing on workload sensitivity.

April 10, 2026Open Access

PTO: A Workload-driven Predictive Table Optimizer for Lakehouse Systems

Key Points

The research aims to develop an automated optimizer for table layouts in lakehouse systems, focusing on workload sensitivity.
Implemented a predictive optimizer (PTO) on Presto lakehouse engine for Apache Iceberg tables.
Optimized parameters such as partitioning columns and file sizes to improve data retrieval efficiency.
Conducted experiments using TPC-H and TPC-DS benchmarks to evaluate performance.
Achieved an 11% reduction in average workload latency on TPC-H benchmarks.
Reduced average latency by 36% on TPC-DS benchmarks.
Improved query speeds for scan-intensive queries by 3.4× on TPC-H and 11× on TPC-DS.

Abstract

Data lakehouse architectures manage both structured and semi-structured data, often using disaggregated storage with volumes that can reach petabyte scale of data stored in open table formats such as Apache Iceberg. Due to the size and storage structure, traditional indexes are cumbersome to maintain resulting in the need for effective table organization to enable efficient retrieval of relevant data for analytical queries. To maximize skipping of irrelevant data while scanning large tables, lakehouse systems rewrite data files according to pre-specified partitioning columns, target file sizes, row group sizes, and bin-packing or sort strategies. Optimizing these parameters can enhance the skipping of irrelevant data during table scans and improve query performance significantly. State-of-the-art lakehouse systems often require these parameters to be manually specified by the user which is impractical due to the combinatorial search space of parameter values thereby severely impeding the usability of the existing table optimization features in these systems. Conducting an exhaustive search to find the best combination of these parameters is impractical because these parameters are interdependent on each other, and rewriting a table with a single instantiation of all the four parameters can already take several hours at terabyte-scale. This comes with the additional complexity that optimal parameter value settings are query workload-sensitive as the filter predicates associated with the scan operators in the workload determine the skipping benefits we can get on a data layout. While our solution is applicable to lakehouse systems and open table formats which adopt similar parameterized layouts, we implemented PTO on Presto lakehouse engine to optimize Apache Iceberg tables. Our experiments show that PTO reduces the average workload latency by 11% on TPC-H and 36% on TPC-DS benchmarks at SF 10K while speeding up scan-intensive, long latency queries by 3.4× and 11× respectively.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Venkata Vamsikrishna Meduri

David Kreismann

Ronald Barber

Journals

Proceedings of the ACM on Management of Data

Actions

Institutions

IBM Research - Almaden

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

PTO: A Workload-driven Predictive Table Optimizer for Lakehouse Systems

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study