Tiny Thoughts: Cutting Back on Snowflake
Snowflake is a cloud data warehouse to store all your analytics data. To anyone that uses Snowflake, you’re likely aware that it is expensive.
There’s been a lot of work done on open-source columnar data warehouses and SQL engines that now make these tools attractive alternatives for running all or some workloads. I’ve recently been looking at DuckDB and Trino for various needs.
DuckDB is an in-process SQL database for analytics that I think can really benefit analysts or data scientists doing ad-hoc analyses. We can maybe even use this to offload some dbt tasks. Here’s a quick rundown on why folks are excited for DuckDB.
Trino is a little different, it’s a distributed SQL query engine for analytics. Trino is not a database itself but rather connects to many different databases. It can connect to traditional relational databases like Postgres or even directly to cloud storage. What’s nice is that it all uses the same SQL dialect to query any source. You also own the infrastructure this runs on, saving Snowflake compute costs again.
We have a lot of parquet files in the cloud. Using something like Iceberg on your cloud storage layer and then setting Iceberg as a data source to Trino seems like the path forward for our big data needs. Snowflake also supports Iceberg tables.
I’m still investigating new solutions for my teams but it’s an exciting time to be in the big data space with all these open source tools being developed.