What is a Data Lakehouse: A Quick Explanation for Data Folks
A data lakehouse is a data engineering architecture that combines the governance and performance of a data warehouse with the reliability and scalability of a data lake.
This post will briefly explain what makes a data lakehouse so intriguing and how to build a data lakehouse.
A data lakehouse is the best of a data lake combined with the best of a data warehouse.
Data lakes offer infinitely scalable, reliable, and inexpensive storage for data. They are also able to store any type of data.
However, they can get messy. Without a governance or metadata layer it can be difficult for users to find the data they need, track down dataset owners, and understand which data is the source of truth.
Data warehouses, like Snowflake, Redshift, Big Query, offer a governance layer and better performance on your queries. You can control access, enforce table schemas, and leverage proprietary storage formats for better query performance.
However, a data warehouse can be costly to operate and requires anyone accessing your data to go through the warehouse.
The data lakehouse combines the best of these two worlds.
A data lakehouse keeps your data in the data lake. Through the use of a data lake table format you are able to achieve a level of governance and performance similar to what data warehouses provide.
Why is a data lakehouse exciting?
There are many reasons but the two biggest ones in my opinion are:
Data lakes become the source of truth for your data. You no longer need to spend time and money to load data to a warehouse before it’s consumable. In other words, faster time to insight.
Use the right tool for the job. You have full access to the open-source ecosystem of data tools. BI and ML workloads are properly supported with a lakehouse because there is no limitation on what tools you can use to operate on your data. You are no longer forced to go through a data warehouse.
The key component of a data lakehouse is a modern data lake table format.
The data lake table format tracks metadata about your data files and represents a snapshot of what files at a given point in time make up your table. This is enables us to have a level of governance and improved performance over our data lake.
The reason data lakehouse are growing in popularity now is the rise of a new generation of data lake table formats. In the beginning, we had Hive. Which worked well for some time, but data is only partitioned by folders in cloud storage. You don’t get too much else.
With Apache Iceberg, Apache Hudi, and Delta Lake, we now have a suite of more advanced features. Additional metadata is collected which greatly enhances pruning at query time. There are also ACID guarantees that allow for concurrent reads and writes.
Four Steps to Build a Data Lakehouse
Decide if a data lakehouse is the right architecture for you. It requires a fair bit of investment to get started. A traditional data warehouse can be sufficient for a long time.
Choose your data lake table format. Iceberg and Delta Lake roughly have feature and performance parity. Delta Lake is driven by Data Bricks, Iceberg by Apache. I’m partial to Apache Iceberg as they merged one of my commits. Apache Hudi is more for streaming use cases.
Produce data using the new data lake table format. Update your existing data producing jobs and migrate existing ones.
Point any query engine or tool you’d like at your given data lake table format.
After deciding if you should build a data lakehouse, the bulk of the effort will be in driving adoption and migrating existing datasets to use the new data lake table format. Once that is done you can continue to enable whatever data tools are right for your organization and use cases.