Amazon Redshift : Best Practices

Amazon Redshift : Best Practices

·

11 min read

Amazon Redshift, initially released in 2012, is a data warehouse product built on top of PostgreSQL technology from the massive parallel processing (MPP) to handle large scale data sets and database migrations. Redshift differs from Amazon's other hosted database offerings like, Amazon RDS and Aurora, in its ability to handle analytic workloads on big data data sets stored by a column-oriented DBMS principle enabling it to store and analyze petabytes of data at scale. Amazon Redshift provides an open standard JDBC/ODBC driver interface, which allows us to connect our existing business intelligence (BI) tools and reuse existing analytics queries.

While redshift has been used for almost everything from accelerating existing database environments, to ingesting weblogs for big data analytics, or as an OLAP cube, it can run any type of data model, from a production transaction system third-normal-form model to star and snowflake schemas, data vault, or simple flat tables.

However, acquiring the true full potential of this amazing service provided by AWS is quite a challenge. This blog post will focus on a few best practices that can be adhered to, in order to maximize and optimize the use of Amazon Redshift.

Load & Store Optimization

  1. Uilizing Materialized Views:

    Materialized views can significantly boost query performance for repeated and predictable analytical workloads such as queries from BI tools, and ELT data transformation queries. Data engineers can easily create and maintain efficient data-processing pipelines with materialized views while seamlessly extending the performance benefits to data analysts and BI tools.

    Materialized views are especially useful for queries that are predictable and repeated over and over. Instead of performing resource-intensive queries on large tables, applications can query the pre-computed data stored in the materialized view. When the data in the base tables changes, materialized view can be refreshed to accumulate the newly available data by issuing the Amazon Redshift SQL statement “refresh materialized view“. Refreshes can be incremental or full refreshes (recompute).

    Lets understand the concept with an example:

    Let’s create an example schema to store sales information, each sale transaction and details about the store where the sales took place. To view the total amount of sales per city, we create a materialized view with the create materialized view SQL statement (city_sales) joining records from two tables and aggregating sales amount (sum(sales.amount)) per city (group by city):

     CREATE MATERIALIZED VIEW city_sales AS
     (
         SELECT st.city, SUM(sa.amount) as total_sales
    
         FROM sales sa, store st
    
         WHERE sa.store_id =st.id
    
         GROUP BYst.city
    
     );
    

Now we can query the materialized view just like a regular view or table and issue statements like “SELECT city, total_sales FROM city_sales” to get the following results. The join between the two tables and the aggregate (sum and group by) are already computed, resulting in significantly less data to scan.

When the data in the underlying base tables changes, the materialized view doesn’t automatically reflect those changes. You can refresh the data stored in the materialized view on demand with the latest changes from the base tables using the SQL refresh materialized view command.

REFRESH MATERIALIZED VIEW city_sales;
  1. Efficient use of Temporary Tables:

    Amazon Redshift provides temporary tables, which act like normal tables but have a lifetime of a single SQL session. The proper use of temporary tables can significantly improve performance of some ETL operations. Unlike regular permanent tables, data changes made to temporary tables don’t trigger automatic incremental backups to Amazon S3, and they don’t require synchronous block mirroring to store a redundant copy of data on a different compute node. Due to these reasons, data ingestion on temporary tables involves reduced overhead and performs much faster. For transient storage needs like staging tables, temporary tables are ideal.

    We can create temporary tables using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. The CREATE TABLE statement gives us complete control over the definition of the temporary table. The SELECT … INTO and C(T)TAS commands use the input data to determine column names, sizes and data types, and uses default storage properties. By default, for temporary tables, Amazon Redshift applies EVEN table distribution with no column encoding (such as RAW compression) for all columns.

    If we use the SELECT…INTO syntax, we can’t set the column encoding, column distribution, or sort keys. The CREATE TABLE AS (CTAS) syntax instead lets you specify a distribution style and sort keys, and Amazon Redshift automatically applies LZO encoding for everything other than sort keys, Booleans, reals, and doubles. We can also exert additional control by using the CREATE TABLE syntax rather than CTAS.

    When we create temporary tables, make sure to convert all SELECT…INTO syntax into the CREATE statement. This ensures that our temporary tables have column encodings and don’t cause distribution errors within the workflow.

    When we create a temporary staging table by using a CREATE TABLE LIKE statement, the staging table inherits the distribution key, sort keys, and column encodings from the parent target table. In this case, merge operations that join the staging and target tables on the same distribution key performs faster because the joining rows are collocated. To verify that the query uses a collocated join, run the query with EXPLAIN and check for DS_DIST_NONE on all the joins.

    We can also analyze statistics on the temporary table, especially when you use it as a join table for subsequent queries.

    Use the following query to analyze and update the table statistics of the temp table:

     ANALYZE my_temp_table;
    

    With this, we retain the functionality of temporary tables along with controlling the data placement on the cluster through distribution key assignment.

  2. Making data loads efficient:

    Amazon Redshift best practices suggest using the COPY command to perform data loads across file-based data. The COPY operation makes use of all the compute nodes of the redshift cluster to load data in parallel, from sources such as Amazon S3, Amazon EMR HDFS file systems.

    When performing data loads, it is a good practice to compress the data files whenever possible. For row-oriented (CSV) data, Amazon Redshift supports both GZIP and LZO compression. It’s more efficient to load a large number of small files than one large one, and the ideal file count is a multiple of the cluster’s total slice count. Columnar data, such as Parquet and ORC, is also supported. We can achieve best performance when the compressed files are between 1MB-1GB each.

    The number of slices per node depends on the cluster’s node size (and potentially elastic resize history). Query for the cluster’s current slice count with

     SELECT COUNT(*) AS number_of_slices FROM stv_slices;
    

Configurational Optimization

  1. Handling bursts of workload with concurrency scaling and elastic resize

    The legacy, on-premises data warehousing technologies require us to estimate what the system will need 3-4 years in the future to make sure we’re leasing enough storage, memory and compute at the time of purchase. But the ability to resize a cluster allows for right-sizing our resources as you go. This is where Amazon Redshift’s ability to elastic resize and concurrency scaling comes to play.

    Elastic resize lets us quickly increase or decrease the number of compute nodes, doubling or halving the original cluster’s node count, or even change the node type. We can expand the cluster to provide additional processing power to accommodate an expected increase in workload. Choosing elastic resizing mechanism is recommended by AWS but we can choose classic resize when we’re resizing to a configuration that isn’t available through elastic resize like resizing to or from a single node cluster configuration. Classic resize is slower but allows us to change the node type or expand beyond the doubling or halving size limitations of an elastic resize.

    Elastic resize completes in minutes and doesn’t require a cluster restart. For anticipated workload spikes that occur on a predictable schedule, we can automate the resize operation using the elastic resize scheduler feature on the Amazon Redshift console, the AWS Command Line Interface (AWS CLI), or API.

    Concurrency scaling allows your Amazon Redshift cluster to significantly add capacity dynamically in response to the workload arriving at the cluster. By default, concurrency scaling is disabled, and we can enable it for any workload management (WLM) queue to scale to a virtually unlimited number of concurrent queries, with consistently fast query performance.

  2. Leveraging Auto WLM with priorities to increase throughput

    Amazon Redshift runs queries using the queuing system (WLM). You can define up to eight queues to separate workloads from each other. Amazon Redshift Advisor automatically analyzes the current WLM usage and can make recommendations to get more throughput from the Redshift cluster. Periodically reviewing the suggestions from the Advisor can help us get the best performance.

    Auto WLM simplifies workload management and maximizes query throughput by using ML to dynamically manage memory and concurrency, which ensures optimal utilization of the cluster resources

    Query priorities is a feature of Auto WLM that lets us assign priority ranks to different user groups or query groups, to ensure that higher priority workloads get more resources for consistent query performance, even during busy times. It is a good practice to set up query monitoring rules (QMR) to monitor and manage resource intensive or runaway queries. QMR also enables us to dynamically change a query’s priority based on its runtime performance and metrics-based rules you define.

    We can also leverage Amazon Redshift’s sort query acceleration (SQA), which uses ML to run short-running jobs in their own queue. This keeps small jobs processing, rather than waiting behind longer-running SQL statements. SQA is enabled by default in the default parameter group and for all new parameter groups. When we enable concurrency scaling, Amazon Redshift can automatically and quickly provision additional clusters should the workload begin to back up. This is an important consideration when deciding the cluster’s WLM configuration.

    A common pattern is to optimize the WLM configuration to run most SQL statements without the assistance of supplemental memory, reserving additional processing power for short jobs. Some queueing is acceptable because additional clusters spin up if your needs suddenly expand. To enable concurrency scaling on a WLM queue, set the concurrency scaling mode value to AUTO. You can best inform your decisions by reviewing the concurrency scaling billing model. You can also monitor and control the concurrency scaling usage and cost by using the Amazon Redshift usage limit feature.

    First, determine if any queries are queuing, using the queuing_queries.sql admin script. Review the maximum concurrency that your cluster needed in the past with wlm_apex.sql, or get an hour-by-hour historical analysis with wlm_apex_hourly.sql. Keep in mind that increasing concurrency allows more queries to run, but each query gets a smaller share of the memory. You may find that by increasing concurrency, some queries must use temporary disk storage to complete, which is also sub-optimal.

Serving Optimization

  1. Amazon Redshift to s3 data lake integration

    AWS provides an awesome integration across its available services, Amazon Redshift is tightly integrated with other AWS-native services such as Amazon S3 this is most primarily used as a data lake store in the data industry which lets the Amazon Redshift cluster interact with the data lake in several useful ways.

    Amazon Redshift Spectrum lets us query data directly from files on Amazon S3 through an independent, elastically sized compute layer. Use these patterns independently or apply them together to offload work to the Amazon Redshift Spectrum compute layer, quickly create a transformed or aggregated dataset, or eliminate entire steps in a traditional ETL process.

    Using Amazon Redshift Spectrum compute layer to offload workloads from the main cluster, and apply more processing power to the specific SQL statement has been a very common approach among data practitioners. Amazon Redshift Spectrum automatically assigns compute power up to approximately 10 times the processing power of the main cluster. This may be an effective way to quickly process large transform or aggregate jobs. Thus, we can skip the load in an ELT process and run the transform directly against data on Amazon S3. We can run transform logic against partitioned, columnar data on Amazon S3 with an INSERT … SELECT statement. It’s easier than going through the extra work of loading a staging dataset, joining it to other tables, and running a transform against it and can be used to power real-time analytics.

    Using familiar CREATE EXTERNAL TABLE AS SELECT and INSERT INTO SQL commands, create and populate external tables on Amazon S3 for subsequent use by Amazon Redshift or other services participating in the data lake without the need to manually maintain partitions.

    Materialized views can also cover external tables, further enhancing the accessibility and utility of the data lake.

    Using the UNLOAD command, Amazon Redshift can export SQL statement output to Amazon S3 in a massively parallel fashion. This technique greatly improves the export performance and lessens the impact of running the data through the leader node. We can further compress the exported data on its way off the Amazon Redshift cluster. As the size of the output grows, so does the benefit of using this feature. For writing columnar data to the data lake, UNLOAD can write partition-aware Parquet data.

  2. Federated queries, connecting OLAP, OLTP and data lake universe:

    Amazon Redshift allows us to run analytics directly against the live data residing on our OLTP source system databases and Amazon s3 data lake without the overhead of performing ETL and ingesting source data into Amazon Redshift tables using the Federated Query feature. This feature gives us a convenient and efficient option for providing real time data visibility on operational reports, as an alternative to micro-ETL batch ingestion of real time data into the data warehouse. By combining historical trend data from the data warehouse with live developing trends from the source systems, we can gather valuable insights to drive real-time business decision making.