A new tool for Snowflake optimization (part 2)
January 25, 2023
Continuing the discussion…
In part I of our series, we introduced a new Snowflake table function in public preview – GET_QUERY_OPERATOR_STATS. We demonstrated how you can use this new feature to programmatically explore queries run in the last fourteen days and ensure they’re not resulting in “exploding joins”, or returning result sets with far more records than the combined records in the tables queried. This is a very common query problem which always ends up costing a great deal of money. However, it’s often difficult to track down; especially when queries are complex. If you haven’t read Part I, we highly recommend you do so first to get some of the context and additional details.
In this post, we’ll focus on another common query problem which can be uncovered with the get_query_operator_stats table function – unions with, and without the ‘ALL’ qualifier. As with exploding joins, this is an issue which was previously only identifiable via the Snowflake Query Profile. Also as with exploding joins, it’s an issue difficult to identify even with the Query Profile when queries are sufficiently complex; and when are they not?
“Union with, and without all”
The second optimization opportunity which is difficult to ascertain from the SNOWFLAKE.ACCOUNT_USAGE views, and not terribly easy to catch in the graphical query profile is queries which use UNION when UNION ALL is sufficient. The UNION set operator concatenates two result sets and removes any duplicates. If you add in the optional ALL, Snowflake will skip the deduplication step, which is fine in the majority of cases.
To illustrate this point, we can union the shared snowflake_sample_data.tpch_sf1.customer table to itself with this query:
select c1.c_name from snowflake_sample_data.tpch_sf1.customer c1 where c1.c_acctbal > 3000 union select c1.c_name from snowflake_sample_data.tpch_sf1.customer c1 where c1.c_acctbal < 3000 ;
Analyzing this query with the get_query_operator_stats function clearly shows we didn’t need the deduplication step. As mentioned in our first post, the get_query_operator_stats table function returns all the information presented in the query profile. However, we’ve parsed out some of the variant fields and limited the results to highlight records input and output.
The deduplication step of the UNION operator is identified by the “Aggregate” operator type (row 2) after the “UnionAll” step (row 3) below. Notice how we output the same number of rows we input. In other words, there were no duplicates to be concerned with.
If we add the optional ALL to our union, the results of the get_query_operator_stats are:
Although the result sets are the same, the first query took 1.8 seconds to complete while the second only took 415 milliseconds, or roughly less than a quarter of the time of the first query. Imagine shaving 75% off your compute costs!
The get_query_operator_stats table function is a fantastic new feature in a long line of Snowflake features introduced to help you control your costs. In these last two posts, we demonstrated how to use this new feature to quickly identify problem queries so you can optimize them and dramatically reduce your Snowflake spend.
Contact us to talk more about how DAS42 can help you monitor and control your Snowflake costs. We are a Snowflake Elite Services Partner, winner of the 2022 Americas Innovation Partner of the Year award, and have helped all of our clients ensure they’re getting maximum ROI from their Snowflake implementations.