Introduction to Snowflake Data Warehouse
Snowflake is a cloud-based Analytical data warehouse provided as Software-as-a-service. Snowflake is ready to use a solution that the user requires to just use it directly without worrying about its installation and deployment and then its startup. Snowflake doesn’t allow any user to use Snowflake on any private infrastructure as it uses public cloud infrastructure. Snowflake has an instance for the management of computations, and it persists data through storage service. As Snowflake doesn’t require any installation or deployment, users need not worry about its updations and maintenance, and Snowflake automatically handles this for its all types of users. Snowflake users may include end-users, business analysts, or data scientists as well.
Overview of Snowflake Architecture
Snowflake is a pay as you go service. It charged users on the per-second basis of computation or query processing. There exist some data warehouses that use either shared-disk or shared-nothing architecture, but Snowflake uses hybrid architecture comprises of both shared-disk or shared-nothing architecture. Traditional shared-disk architecture has multiple cluster nodes. In contrast, just one storage node accessible to these all cluster nodes and traditional shared-nothing architecture contains nothing in common, i.e., different storage nodes for each cluster node (CPU). Snowflake’s hybrid architecture has three layers –
- Cloud Services Layer
- Query Processing Layer
- Database Storage Layer
Database Storage Layer
- Snowflake has Scalable cloud blob storage type for storing structured and semi-structured data (including JSON, AVRO, and Parquet).
- The storage layer contains tables, schemas, databases, and diverse data. Tables can store as much as of petabytes of data and can have partitions effectively managed.
- Snowflake provides internal optimization and compression of data for easy metadata extraction and query processing.
- Data objects in Snowflake are only accessible via SQL queries through the Compute layer and are hidden to users.
- Storage is made up of multiple micro partitions that scale automatically when required.
- All Micro partitions are encrypted, and metadata is generated to provide more efficiency and performance in data retrieval.
Query processing Layer
- This layer is the compute layer of architecture that contains multiple virtual warehouses, and every query runs on one virtual warehouse.
- Virtual warehouses are the compute engines of Snowflake that empower the scalable compute capacity.
- The compute layer retrieves data from the storage layer, and cached to compute resources locally to improve future query responses, i.e., each Virtual warehouse had its cache.
- Multiple virtual warehouses can operate simultaneously and also maintains ACID and follow Multiple parallel processing to perform on data.
- Read and Write query integrity is maintained on the cloud service layer that accesses required virtual warehouse and computes nodes.
- Virtual Warehouses can be auto-resume, and auto suspend, easily scalable, has auto-scaling factor inbuilt (when defined).
Cloud Services Layer
- The service layer is the “Brain” of the Snowflake. It maintains, optimizes, transacts on data, provides security on data, metadata management, and data sharing.
- Cloud service is Stateless compute resource that runs on multiple availability zones and utilizes highly available and useful metadata.
- Service layer enables SQL client interface for DDL and DML like operations on data.
- Like other layers, this layer can also be independently scalable.
Why Snowflake Architecture?
3.1 Warehouse Usages
There are different sizes of warehouses available in Snowflake from X-Small to 4X-Large. Each volume of the warehouse contains a different number of cluster nodes and had different billing. The bigger the size of the warehouses, the more the workers. But one must be aware of using each type of warehouse as a query must be capable of loading by a particular warehouse and had enough clusters or servers provided.
Increasing the size of the data warehouse is not only the factor of performance enhancement in the query. Data must be stored in a fully optimized and normalized form of partitions so that fewer files need to be processed for loading results.
Snowflake can auto-suspend and auto-resume the warehouse upon provided. It means the warehouse can be auto stopped when it is inactive for some time. It wakes up automatically when any query is submitted to that warehouse.
One can simultaneously run multiple queries, and Snowflake has a concurrent mechanism to manage and provide integrity among multiple running queries. For this, the warehouse of Snowflake used multi-cluster architecture to reserve and calculated required to compute resources.
Snowflake warehouses manage the cache to provide efficiency in queries. One must be aware if there is a need to store cache or not. If there is a need to cache (i.e., data is large and the query processing time is more), then one must re-think to enable/disable the auto-suspend and resume feature as every time the warehouse restarts, it runs more queries and also more processing will be required.
3.2 Roles to Access Data
Control on data access in Snowflake is delivered by defining roles on data. Database Admin has the privilege to create as many characters as needed with granted access to particular data objects such as tables, schemas, and databases. DBA can provide Read, write access also to need specific roles. Snowflake has an advantage that one role can be assigned to another role as well to maintain a hierarchy of privileges on data control access.
3.3 Partition Blocks and Updations
Data stored in Snowflake is in the form of micro-partitions. The size of these micro-partitions may vary between 50-500 MB. Snowflake automates the process of resizing the partition, and the user also has the privilege to apply such operations to resize these micro-partition blocks. The effectiveness of partitions can be a measure while retrieving data.
When partitions are appropriately managed and contain processed and informational cluster partitions, the query performance will be auto increases, and the query takes less time to generate results from stored data. When no cluster information is saved with tables and queries contains the data that needs to be accessed by columns, then Snowflake downgrades the performance. It may cost more and also processes irrelevant data.
Suppose a table contains TB of data, and then we apply to recluster to that table. In this case, data is rearranged to the newly defined format as it needs to extract the required cluster information.
3.4 Time Travel
Snowflake provides a feature to time travel on data in the past for a specified interval of time. By default, every new table in Snowflake has one day of data time travel. This provides more value when there is a need to restore the data from a specific point on time in the past and to map data effectively to measure data changes and volume of data that needs to operate or process again. Another benefit to enterprises is that they can evaluate how data is being updated from time to time. It helps to restore the tables, schema, and database upon provided the retention period to time travel (from 1 day up to 90 days).
Snowflake provides a feature to manage workarounds on data. Users can create as many worksheets as they require with the commenting feature enabled. It means one can perform statistics on data and create a worksheet of those statistics. This helps to work on the go without managing local reports and saving them. We can keep worksheets for as long as we need them, and we can update them at any time.
Analysis of Queries in Snowflake
The performance of queries in Snowflake is available as Query Profile in Snowflake Web Interface. When a user performs any query, and select that query, Snowflake will display graphical and statistical information of that query. The snowflake interface provides an option to view the history of queries run in the past. Selecting a query from history and moving to its profile tab will display the query profile. It contains a DAG of the query plan. By reviewing this plan of query in Snowflake, it becomes easier to eliminate unused parts and generate productive queries that process fewer data and generates required data only.
Queries containing joins and aggregation are costly as Snowflake is not meant to perform these operations on the go. In contrast, in some cases, Snowflake keeps track of some basic aggregation operations on the go while saving data internally.
Snowflake Best Practices
5.1 Best practices for Schema Design
- Most of the time, tables contain a field for timestamp or date data type for required granularity. Snowflake provides an efficient result on timestamp and date type rather than VARCHAR type.
- Using the clustering key is a good practice. If your table contains a small amount of data and aggregate data is required all the time, then these clustering may result in the generation of many small cluster files.
- Always remember the column data length in case of the Snowflake as its storage may generate erroneous data when outputted.
- Snowflake stores data of timestamp and date types in string formats when data type used in VARIANT, so in this case, it is recommended to flatten the structure and use appropriate type. Retrieving a string is slower than retrieving a timestamp.
5.2 Best practices for Query Optimization
- Clustering helps to store data in the form of cluster storage files with named partitions. Tables having effective and active clusters may have bounded queries to these clusters, such as using where conditions on these clusters.
- Using warehouse suspension and auto-resume also affect the query. By default, the warehouse suspension time is 10 minutes. The segmented workload must be handled by a typical virtual warehouse and by multiple users (if required). This will help to re-use the result cache generated by Snowflake and hence provides instant results on the same data.
- If concurrent data is needed, Snowflake can auto-scale the warehouse, and it can be updated from the user interface. One should use this feature as well as an optimized query together to obtain optimization.
5.3 Best practices for Efficient Reads
Snowflake reader is quite slow and expensive if we are accessing data with aggregations directly from Snowflake, so use “query” instead of “dbtable” in Spark-Snowflake connector, if data has some limits and conditions but no aggregation.
5.4 Best practices for Efficient Writes
If you are writing data to snowflake table in Spark, and that table already exists, try to select data in the same sequence/order of columns as they are in Snowflake. The workaround to resolve this is to use the “column_mapping” option as “name” while writing. “column_mapping” -> “name” approach is CASE SENSITIVE.
Snowflake Pricing and Best Practices for Optimizing Cost
Snowflake charged the users on average monthly storage per TB and paid per second billing. This depends on the other three factors as well, such as –
- In a multi-cluster environment, the number of clusters used
- Count of servers used in each clustering
- Running time of each server in each cluster
Optimizing and segmentation of workloads based on these three factors help to reduce cost such as using a specific size of the warehouse for a particular workload and by particular users with some caching retention time (or warehouse suspension time) help to reuse the cache and hence retrieves result faster.
Less complex queries help in low billing, and these complexities in queries can be removed by applying certain cluster access conditions and by removing massive joins. Joins causes more shuffle operations as compared to direct loading. It is always recommended to load the data through Spark, looker, etc. and perform activities such as joins and aggregation on data in them.
The cost can be reduced by identifying long-running queries from Query profile and then simplifying the schema. Also, we can define roles for the users to prevent irrelevant queries on data and unauthorized access.
Snowflake allows to create of transient tables and also doesn’t preserves their history. So, the average monthly storage cost can be reduced on the go.
A Holistic Approach
With cloud data warehouse offerings Enterprises can enable elasticity, reduce I/O for better performance per query and increase faster performance, to learn more about how data warehouse solutions are transforming industries we recommend taking the following steps –