AWS Data Pipeline Load data from S3 into redshift couple useful settings compression, change schema, VPC

Data Pipelines are pretty cool, but there are a some ‘nerd knobs’ that can be hard to find. I’ll show a few we have discovered. Credit to Javier Murrieta for the first two!

One quick way to get your feet wet is with the ‘Load data from S3 into redshift’ template

A common pattern is doing a redshift ‘upsert’ using a staging table

https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

After creating this pipeline, you might be wondering how to ingest a gzip compressed file.

To do so, click into your S3InputDataNode, and do the ‘Add an optional field’

Choose compression, and type, GZIP

Another useful thing is to be able to change from the default schema.

This is under the RedShiftDataNode under DestRedshiftTable

The final thing I wanted to do was connect to a redshift cluster inside a VPC. This article explains how to do it in the pipeline config

https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-resources-vpc.html

But, it is located in the UI under Ec2Resource, you have to add the subnet and Security Group ID

There you have a couple of useful settings

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Michael Ellerbeck

Michael Ellerbeck

AWS Data Pipeline Load data from S3 into redshift couple useful settings compression, change schema, VPC

Leave a comment Cancel reply

AWS Data Pipeline Load data from S3 into redshift couple useful settings compression, change schema, VPC

Share this:

Related

Leave a comment Cancel reply