Jul 13, 2017

How to Move Your Data from S3 to Redshift

INAP

Amazon Redshift is a powerful and fully managed data warehouse solution from AWS. It makes it extremely easy and cost-effective to analyze your data using standard Business Intelligence tools. But before we get into what Redshift can do for you it is important to also say what it can’t, or rather, shouldn’t do for you. Redshift is not built to be an alternative to a standard database (MYSQL, Postgres, etc.) and is not particularly good at:

  1. On demand, real-time updates/deletes
  2. Answering tens of thousands of concurrent queries in milliseconds
  3. Searching specific values by a pre-defined index
  4. Other, non-OLAP use cases

The key point to remember here: Redshift was built for Business Intelligence so its concurrency is low and limited. Rather, it is wise to ingest data into the system in batches.

So in this post we are going to explore a simple example of getting your data into Redshift. In short, we’ll set up a basic EC2 instance for SFTP that will allow users to update the data they want to put into Redshift. From there, we’ll transfer the data from the EC2 instance to an S3 bucket, and finally, into our Redshift instance.

Create S3 Bucket

Let’s start by creating the S3 bucket. After logging into your AWS account, head to the S3 console and select ”Create Bucket.”

Create IAM Policy

Next we are going to create an IAM policy that allows our EC2 instance  to upload to the S3 bucket. We do this by going to the IAM section of the AWS console and clicking on the policies link.

Next, we create a custom policy and choose the following:

  •       AWS Service: Amazon S3
  •       Actions: PutObject, PutObjectAcl, ListAllMyBuckets
  •       Amazon Resource Name (ARN): arn:aws:s3:::redshifts3bucket01 (ARN will always be arn:aws:s3:::BUCKETNAME

 

Next we will name the policy and set a description. We’ll also be able to view the JSON version of the policy, which will look something like this:

{

“Version”: “2012-10-17”,

“Statement”: [

    {

        “Sid”: “Stmt1498839988000”,

        “Effect”: “Allow”,

        “Action”: [

               “s3:ListAllMyBuckets”,

            “s3:PutObject”,

            “s3:PutObjectAcl”

        ],

        “Resource”: [

               “arn:aws:s3:::redshifts3bucket01”

        ]

       }

From here, we need to create a Role in our IAM console, click on and then click “Create New Role.” Select the Amazon EC2 option and then type in the first few letters of what you named your policy. Click the checkbox next to the name.

Click “Next,”name your role, and click “Create.”

Spin Up EC2 Instance

Now we are going to spin up our EC2 instance. Head to the EC2 section of the AWS console and choose to launch an instance. Select the Amazon Linux AMI and the t2.micro free tier.

On the “Configure Instance Details” page, specify your VPC and subnet and also select the option for assigning a public IP. Be sure to select the role you just created for the instance in the “IAM role” field.

You can leave the default in place for the “Storage” and “Tags” pages or make any changes you would like. On the Configure security group page we will need to create a new group and allow the following:

  •       SSH, Port 22 – 0.0.0.0/0
  •       Custom,  Ports 1024-1048 – 0.0.0.0/0

Click “Review” and then  “Launch,” at which point you will need to either use an existing keypair or create a new one.

Create Redshift Cluster

While waiting for the instance to spin up we will create our Redshift cluster. Go to the Redshift page in the AWS console

Firstly, if you are not using the default VPC and do not have a subnet group you will need to create one. You can do so by clicking “Security” on the Redshift console page and following the steps.

Now click “Launch Cluster” on the Redshift console page:

Provide the cluster details:

  •       Cluster Identifier: redshifts3
  •       Database name: redshifts3
  •       Port: 5439
  •       Master User name: masteruser
  •       Password: ##########

Next select the size of the node and other cluster parameters. In this case, let’s just leave the default.

Continue to the review page and then click “Launch Cluster.”

Next, choose the correct options for your VPC, Subnet group and Security group.

Now choose the correct options for your VPC, Subnet group and Security group.

Configure EC2 Instance

Next up, it’s time to  let’slog in to and configure the EC2 instance we just created. Connect to the instance using SSH;once logged in, type the following commands:

sudo -i

yum update -y

yum install vsftpd

chkconfig –level 345 vsftpd on

vi /etc/vsftpd/vsftpd.conf

Change:

anonymous_enable=YES

to

anonymous_enable=NO

Uncomment:

chroot_local_user=YES

And add the following to the bottom of the file:

pasv_enable=YES

pasv_min_port=1024

pasv_max_port=1048

pasv_address=[YOURPUBLICIP]

Now save the file and restart vsftp.

/etc/init.d/vsftpd restart

Now create an FTP user and password:

adduser redshifts3

passwd Password@@@@

Now we are going to install s3cmd, which will let us easily sync folders to our S3 bucket. Enter the following commands:

wget http://ufpr.dl.sourceforge.net/project/s3tools/s3cmd/1.6.1/s3cmd-1.6.1.tar.gz

tar xzf s3cmd-1.6.1.tar.gz

cd s3cmd-1.6.1

sudo python setup.py install

/usr/local/bin/s3cmd –configure

Then enter the config information:

Enter new values or accept defaults in brackets with Enter.
Refer to user manual for detailed description of all options.

Access key and Secret key are your identifiers for Amazon S3

Access Key: xxxxxxxxxxxxxxxxxxxxxx

Secret Key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Encryption password is used to protect your files from reading

by unauthorized persons while in transfer to S3

Encryption password: xxxxxxxxxx

Path to GPG program [/usr/bin/gpg]:

When using secure HTTPS protocol all communication with Amazon S3

servers is protected from 3rd party eavesdropping. This method is

slower than plain HTTP and can’t be used if you’re behind a proxy

Use HTTPS protocol [No]: Yes

New settings:

 Access Key: xxxxxxxxxxxxxxxxxxxxxx

 Secret Key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

 Encryption password: xxxxxxxxxx

 Path to GPG program: /usr/bin/gpg

 Use HTTPS protocol: True

 HTTP Proxy server name:

 HTTP Proxy server port: 0

Test access with supplied credentials? [Y/n] Y

Please wait, attempting to list all buckets…

Success. Your access key and secret key worked fine 🙂

Now verifying that encryption works…

Success. Encryption and decryption worked fine 🙂

Save settings? [y/N] y

Configuration saved to ‘/root/.s3cfg’

Last we just need to setup a bash script to copy the data into Redshfit:

copy customer

from ‘s3://mybucket/mydata’

iam_role ‘arn:aws:iam::0123456789012:role/MyRedshiftRole’;

There you have it. While this may seem like a lot, the whole process of moving your data from S3 to Redshift is fairly straightforward. Once you’re up and running, I have no doubt that you’ll find Amazon Redshift to be a reliable data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL along with your existing Business Intelligence (BI) tools. Any questions about these steps? Get in touch.

Explore HorizonIQ
Bare Metal

LEARN MORE

Stay Connected

About Author

INAP

Read More