A Practical Guide to Baselining and Optimization
As a Snowflake Services Elite Partner, DAS42 frequently works with clients at many different stages of their respective journeys with Snowflake. Business logic, delivery SLAs (Service Level Agreements), and reporting use cases differ widely across their customers, but everyone from early adopters to recent converts share two primary concerns: performance and cost. While clustered tables and optimized queries can have a notable impact on performance and cost in Snowflake’s platform, I believe the most immediately accessible adjustment to be made in any use case is warehouse sizing.
What are warehouses?
Warehouses are the computational component through which all queries and tasks are run in Snowflake’s data platform. They are most easily thought of as cloud warehouses, similar to a VM (Virtual Machine) in the sense that they are an on-demand computation resource that is entirely separate from any data housing system. One of Snowflake’s most compelling selling points and largest advancements as an OLAP (Online Analytical Processing) offering is the way in which it separates storage and compute infrastructure and the costs associated with them. Practically, this means the power and persistence of the compute resources used for analytics query processing are in no way bound to the data loading characteristics or overall storage volume of a given instance. The administrators in a Snowflake instance have the power to define the level of compute infrastructure available in the instance via the provisioning of cloud warehouses. Warehouses can be created in numerous, pre-configured sizes, ranging from XS to 4X-Large (up to 6X-Large in certain regions), each with twice the computational power of the next smaller size.
How does warehouse size affect cost and performance?
Warehouses, regardless of size, incur usage costs based on the amount of time they are running. “Running” simply means the warehouse is either actively processing queries or is waiting and ready as soon as one is issued. It is important to note that a warehouse can be running even if it isn’t actively processing a query. As previously mentioned, each increase in warehouse size corresponds with a doubling of the compute resources available to the warehouse. The same is true for the hourly billing rate, measured in Snowflake Credits, applied to the warehouse when running.
The effect of warehouse sizing on single query performance scales similarly in many cases. For complex queries that make use of large tables and/or multiple joins, increasing the warehouse by one size should cut query processing time in half when tables are efficiently clustered and returned values are evenly distributed. Query processing operations like TableScans, ExternalScans, and Joins are the most computationally heavy stages of a query when it comes to memory usage, and therefore benefit from the additional working memory that comes with increases in warehouse size. For this reason, increasing the size of a warehouse, while also enabling warehouse auto-suspension, can be a great way to reduce query processing time without notably increasing compute costs in Snowflake. It is important to note that highly skewed values in a query’s result set can lead to a reduction in the performance gain of warehouse size increases. Furthermore, query performance gains will become less substantial once the designated warehouse size is able to complete the Scan stages of the most complex queries as quickly as the Processing stages such as filtering, grouping, and aggregates.
Though auto-scaling features to increase warehouse resources vertically and horizontally do exist, identifying the correct default warehouse size for your use case is crucial to minimize costs and maximize performance. Doing so ensures that SLAs can be met with the least amount of incurred compute cost by balancing execution time and billable warehouse uptime. Furthermore, manually identifying the ideal warehouse size removes the need to rely solely on event-drive auto-scaling systems, which can have inefficiencies due to lagging event triggers and more expensive compute costs per computing unit.
How to identify the optimal warehouse size
Choosing the correct warehouse size for your workload within Snowflake can make the difference between cumulative hours of waiting for failed queries and results returned for complex queries in seconds. For this reason, it is important to systematically run baseline tests to help hone in on the optimum warehouse size. Follow these steps any time you are provisioning a new warehouse in Snowflake, or are considering adjusting the size of an existing warehouse.
Prerequisites:
- Permissions to create and alter warehouse resources in your Snowflake instance
- All tables that will be used in queries issued to the warehouse in question are optimally clustered if necessary
Step 1: Describe the warehouse’s use case and define SLA’s
Although this step may seem simple, starting with a description of the user group and functional purpose for a given warehouse will help you instantly exclude certain approaches to warehouse sizing. Will the warehouse serve simple, daily reporting tasks? Will a team of analysts be using the warehouse for sporadic, complex, ad hoc analysis requests? Will a business intelligence tool with a fluctuating volume of queries be using the warehouse? Document these details, and use them to define quantifiable SLAs for query performance. These SLA’s along with the description of the typical query patterns will frame the considerations made in the following steps.
SLA Example:
“With a SLA target of 99%, the latency for query processing in the AD_HOC_ANALYSIS warehouse shouldn’t exceed 30 seconds.”
Step 2: Identify the most and least complex queries that will run on the warehouse
This second step is crucial to selecting the ideal warehouse size because it helps you define the extremes of what the warehouse will be tasked with computationally. To identify these queries, first consider the data objects that are accessible to the user group identified in Step 1. Next, consider what the simplest, analytically meaningful query that would be issued using those data objects might be. Do the same for the most complex potential query, aiming to make use of the largest tables with the most amount of meaningful joins. Asking the user group to guide this process can be helpful if you yourself do not have enough context on the use case, the data resources, or permissions to the needed data objects. It is recommended to use the Explain function in Snowflake to validate the complexity (bytes and partitions scanned) of a given query without having to actually execute the query.
Step 3: Test for the minimum viable warehouse size
Armed with the query performance SLAs defined in Step 1, use the most complex query example(s) from Step 2 to benchmark performance on different warehouse sizes. Start with an XS warehouse and run the query. If the query doesn’t complete within the specified SLA, increase the warehouse size and retest. Conversely, if the query completes in significantly less time than required by your SLA, reduce the warehouse size and retest. Repeat until you reach the minimum warehouse size that can consistently complete the most complex prospective queries within the required time.
Step 4: Consider separating workloads across multiple warehouses based on query complexity
In order to reduce compute costs as much as possible, it may be worth relegating query workloads of similar complexity to separate, differently sized warehouses. This approach can be extremely effective when the distribution of query complexity across the workload of a warehouse is heavily skewed to either extreme. If, for example, 90% of the queries on a given warehouse scan a total of 20 partitions and 400MB, but the remaining 10% of queries average 400 partitions and 8GB, it could be advantageous to split these two workloads up between separate warehouses. Doing so would allow the less complex queries to run effectively on a smaller warehouse, incurring less compute cost per hour while still completing query execution within the required SLA. Meanwhile, the more complex queries could take advantage of a larger warehouse, which could then be more periodically suspended due to the lack of numerous simple queries that would have previously kept the warehouse running and incurring costs.
Step 5: Monitor warehouse performance
Administrators or SysOps engineers should repeat Steps 2 and 3, at minimum, on a quarterly basis. As reporting needs and data pipelines change, so does the complexity of queries issued to your warehouse and the resulting performance of those queries. For this reason, create automated reports either directly in Snowflake through the use of scheduled Tasks or, preferably, in a BI tool of your choosing. Use the Query History view and other administrative metadata tables to track query performance across warehouses, focusing on execution time and time spent in queue. Reports should lend visibility into outlier queries that represent new extremes for query complexity, and should monitor SLA adherence. By making these health checks an automated part of your organization’s scheduled maintenance windows and/or deployment cycles, you can ensure that warehouse performance and costs are always being considered as data and query volumes change.
Additional considerations
The following features are only available to Enterprise Tier accounts and higher.
Reduce query queuing with multi-cluster warehouses
Though changes in warehouse size (vertical scaling) stand to drastically affect single query performance, horizontal scaling through the use of Snowflake’s multi-cluster warehouses can be a powerful tool for handling query volume issues. In cases where the size of a warehouse is appropriate for the complexity of queries being issued to it, but query queuing is delaying execution, consider enabling multi-cluster warehouses. Doing so can allow for more concurrent queries being processed while not increasing the size of the warehouse resources that they are being processed on.
Dynamic vertical scaling with the query acceleration service
Similar to how multi-cluster warehouses allow for dynamic horizontal scaling, the Query Acceleration Service in Snowflake can be a low-maintenance way to handle outlier queries that would typically process slowly on an otherwise optimally-scaled warehouse. This service automatically diverts certain stages of a complex query to compute resources outside of the warehouse the query was issued on. The result is less warehouse uptime and improved query execution time. This approach is recommended in use cases where query and data volume are highly variable (complex ad hoc analysis, sporadic data loading, etc.), since it requires essentially no maintenance. However, when complex queries are consistently present and predictable, separating them out into a designated larger warehouse with its own auto-suspension criteria is recommended.
Key takeaways
In summary, taking the time to manually set baselines for warehouse size in your Snowflake instance will ensure that you are optimizing performance, controlling compute costs, and maximizing the positive impact that auto-scaling features have while minimizing reliance on them. To successfully baseline warehouse size, start by thoroughly defining your warehouses’ use cases and corresponding SLAs, then follow the above testing steps to identify the optimal size for each warehouse. Monitor and retest often as part of your organization’s deployment or review cycles to ensure that warehouses are always running as efficiently as possible, and reap the benefits of warehouses tuned specifically to your end users’ unique needs.
Looking for specific guidance on your use case? Curious to hear more about how DAS42 can help you make the most of your Snowflake instance? Reach out at hello@das42.com and we’d be happy to help!