Migrating Data Between Amazon Redshift Databases.
There are many use cases to migrate data from one Amazon Redshift database to another. In this post, let’s see how we can refresh data in test Amazon Redshift cluster with production Amazon Redshift cluster on scheduled basis using lambda / AWS Batch. Since data in test database often get modified with the test run, we need to refresh the data before starting test run or at regular interval.
Amazon Redshift provides UNLOAD command to unload the result of a query to Amazon S3 bucket and COPY command to copy the data into the table from Amazon S3 bucket. This post explains how to automate the data migration between Amazon Redshift database using UNLOAD and COPY commands and Amazon S3 bucket.
Let’s assume company X is running Amazon Redshift Cluster on production AWS account and test AWS account and ingesting historical data into the tables in Amazon Redshift Database schema from different sources in order to provide data analytical solutions. Testing team doesn’t have access to production environment and they need the latest production data in test environment for testing.
In this case we need to migrate the data in all the tables in Amazon Redshift Database schema from production to test environment. Let’s see the step by step instructions for this task.
- Create Amazon S3 bucket in production AWS account to unload the data from Redshift to S3 and attach the following bucket policy. This policy is to provide access to test AWS account so the unloaded data is accessible for copying to redshift cluster in test AWS account.
2. Create IAM role with the following IAM policy.
3. Associate the IAM role to Redshift cluster in production account. (Open Redshift console, select clusters, then Manage IAM roles and select the role from the list and add it to Associated IAM roles.)
4. Use the following python script to unload data from all the tables in a schema. You can run the script in your local environment or in AWS environment as lambda function or as AWS Batch service.
Before running, make sure you have set the following environment variable with appropriate values. This script first will execute query to select all the tables in the given schema and then for each table, execute unload command with ALLOWOVERWIRITE option. (If the same file already exists in S3 bucket, it will get overwritten)
IAM_ROLE_ARN (IAM Role ARN created in step 2)
S3_BUCKET_ARN (Bucket ARN created in step 1)
After successfully running script, you can find the unloaded data as CSV files in S3 location. Now you can copy the unloaded data to Redshift cluster in test environment. You need to perform the following tasks.
- Create IAM role with following IAM policy in AWS test account.
2. Associate the IAM role to Redshift cluster in test account. (Open Redshift console, select clusters, then Manage IAM roles and select the role from the list and add it to Associated IAM roles.)
3. Use the following python script to copy data from S3 bucket to the tables in a schema. You can run the script in your local environment or in AWS environment as lambda function or as AWS Batch service.
This script will truncate the tables before inserting data into tables. Make sure all environment variables are set with right values.
This post describes how to migrate data between Amazon Redshift Custer in different AWS account using python script. This scripts allow to migrate all the data in a schema in single run. You can run this script in your local environment or AWS cloud environment. If you have less data and execution time of the script is less than 15 mins, you can use lambda function to run this script. If you have massive data and execution time of the script is more than 15 mins, you can use AWS Batch environment to execute this script.