Redshift distribution keys

9/10/2023

All data is encrypted on disk and data can be loaded encrypted from S3 too. Amazon VPC and IAM for user access controls. However even if a Region goes down, if you backup across regions you will be safe too. Again like Hadoop if a Node or Disk goes down there are always others that can take over due to the replication. Also, because Redshift is within the AWS ecosystem it can make use of Amazons Hot/Warm/Cold storage options available in S3 meaning you can backup to S3 across regions (for increased safety). Similar to Hadoop, data is replicated across nodes meaning if a Node was to die, there is always at least 2 other nodes that contain the data. You pay as you go and you pay for the number of nodes x the price per hour. This means that parallel processing becomes difficult as all the data is on the same node, meaning one node is doing all the work. This is important for performance as if you have distribution keys that are too similar then a lot of data will be stored on one node (hotspotting). By picking a distribution key that contains a lot of discrete values means that your data will be evenly distributed across the nodes (as similar keys get stored together). This concept is very similar to the Row Key in HBase. This concept is similar to partitioning a Hive table in Hadoop.

All others can be skipped - heavily reducing I/O. If you access data through where clauses with date ranges, then using a date as part of your sort key will mean that only the nodes and blocks that contain those dates need to be accessed. A sort key will mean all files with similar keys are stored next to each other. This again will minimize unnecessary I/O. Contains things such as MIN and MAX values per block and will prune blocks that don’t contain data for a given query. Fundamentally, the data is stored across compute nodes by the clusters since query performance can suffer greatly if a large amount of data is stored on a single node. Again this reduces I/O and means that each column can be encoded and compressed in isolation meaning you use less storage and therefore I/O. The Amazon Redshift Distribution Keys answer this query by determining where the data is stored. If you are frequently performing column based calculations such as sums and averages then columnar storage will drastically reduce the amount of I/O because all the data items within a column are stored sequentially on disk next to each other. Columnar storage for faster computations.The fact that nodes can be split into slices that execute in parallel means that you can always scale your compute power horizontally. One obvious benefit is the scalability for both storage and processing. Each of the slices get allocated CPU and table data so that they can do work on the data in parallel. It’s architecture is very similar to that of Hadoop in that it has a Leader Node and 1 or more Compute Nodes.Įach of the Compute Nodes are split into slices. Redshift is a scalable, cloud based, data warehousing technology that is marketed to be very cheap. I wanted to delve into Redshift, that isn’t something I’ve covered on the Udemy course I’m doing but is a technology I am particularly interested in from a “big data in the cloud” perspective. This is a quick post following on with the AWS theme from the last two posts: AWS Overview and EC2.

0 Comments

BLOG

Redshift distribution keys

Leave a Reply.

Author

Archives

Categories