We switched from a postgres data warehouse earlier this year to a Snowflake data warehouse. As part of that transition we spoke with sales and learned about what drives your bill on snowflake. It turned out that 15-20% of our bill with snowflake was because of our BI tool, Metabase.
That led me to wonder if I could understand better what was driving our costs on snowflake and ensure that we didn’t get an unexpected bill. Our storage costs are tiny (we’re lucky at Reforge that we have enough data to make it useful, but not so much that the costs are prohibitive / make us move slowly).
I did two things as a result of this:
- Visualized the costs over time.
- We setup Metaplane so that it would monitor our daily usage and alert if it deviated from the historical norm. This was really easy.
Visualizing Costs over Time
I got some great feedback from the DBT community about the types of queries I should be running to run down costs. They asked good questions like:
- How are your warehouses configured?
- What warehouses are creating the most spend?
- Are the warehouses set to auto-suspend?
- What services are contributing to your bill?
- Is your spend mostly on storage or compute?
- What queries cost the most?
Ultimately, one of the things that came out of the analysis was that all of our bill was being used for compute. The sales team at snowflake said that most customers don’t pay for “cloud services” and that we were an outlier. That led us to look into what was driving the overage on our cloud services. I visualized the cost over time with Deepnote for both compute and cloud services. You can see that “overage-cloud services” represent a big part of our bill, and we attributed most of that cost to the metabase queries.
We discovered the issue in April and took some action to try to reduce it. If you want to do the same analysis, I posted the notebook here. I’d love to build a flowchart of the types of things you should be looking at when analyzing your snowflake bill and adding them to the jupyter notebook.
Being Alerted to When Costs Go Haywire
Metaplane is a tool that looks at your data systems and lets you know when something is broken. It could be that tables have stopped being updated, it’s not being updated at the same right, or someone is spamming events to your system. It has many different kinds of tests, but ultimately it helps to ensure that you know when things are going haywire so you can fix it before it causes problems or causes others to lose confidence in the reliability of your data and systems.
We configured Metaplane to run this query every day for both cloud services and compute spend, which are the vast majority of our spend. If it deviates from the historical norm it alerts the team so we can dig into what is increasing our usage and our bill. A big thanks to Ian Whitestone in the DBT community who helped with suggestions on this query as well as other ideas.
SUM(usage_in_currency) as total_cost
where usage_type = 'overage-cloud services'
and usage_date >= current_date - interval '1 day'
and usage_date < current_date
order by 1
In the above image you can see how there was a big spike in usage that was much higher than the previous sixty days. Our team was notified, looked into the issue, and this was the result:
found two queries made in the Snowflake UI that had been running for 20 and 40 hours respectively. I force canceled the queries and let the user know what happened. We also decided to update the query timeout account wide from it’s default value of 48 hours down to 1 hour.One of the data engineers on the team
This is a great example of how Metaplane ensures that we’re on top of anything that is unexpected and can affect our team. It may be data quality issues, reporting issues, delays, logic issues, or something like costs. Examples like this confirm that data observability is a must-have that all companies will have in the next couple of years.
Ultimately, we didn’t think it was worth our team’s time to try to optimize our spend on Snowflake due to our metabase issues. I’d rather the team spend time on creating new value for our company and our customers rather than spending a lot of time to save $1,000 / month. There will come a point where it’s worth it for someone on our team to look at this and a myriad of other areas for improvements, but until then we’ll be monitoring our spend closely to ensure it doesn’t change materially from where it is today. If anyone has any tips or insight into how to address the metabase problems, let me know!