Whenever you want to start your long-term ML project, a strong database and data warehouse system is a must. So, here are some of the tutorials to build a Redshift data warehouse by using AWS Glue and you can later connect the Redshift with Sagemaker for your ML projects!
Redshift is a fully-managed, petabyte-scale data warehouse in the cloud. Regardless of the size of the data set, Amazon Redshift offers fast query performance using the same SQL-based tools and business intelligence applications that you use today. It is based on PostgreSQL, the relational database management system (RDBMS) where it is designed to handle a range of workloads, from single machines to data warehouses. If you would like to know more types of databases, feel free to read here.
Glue is a managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. Amazon has recently launched the Glue version 2.0
AWS Glue version 2.0 is now generally available and features Spark ETL jobs that start 10x faster. This reduction in startup latencies reduces overall job completion times, supports customers with micro-batching and time-sensitive workloads, and increases business productivity by enabling interactive script development and data exploration.
Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. It is built in a Jupyter environment where you can run your ML experiments easily there.
1. S3 bucket
Of course, you will need to upload your data to S3 bucket first for data storage. After creating the S3 bucket, I would like to advise you to put the data in separate folders to ease the file reading in the AWS Glue and to avoid errors in creating the Redshift tables.
2. IAM Roles
Create AWSGlueServiceRole that have these policies attached:
Later, you will want to create a notebook that connects with this Glue endpoints, so make sure to use
AWSGlueServiceSagemakerNotebookRole-SagemakerGluefor the sagemaker endpoints role.
create cluster from the dashboard. Then, set the database name and password. Choose an appropriate database size for that. If you need to have multiple concurrent jobs running, I would advise you to increase the node size and vCPU size. After all, it is a trade-off between the processing time and cost. The smallest cluster might seem cheaper, but when it is run for a long time, where you can actually reduce the processing time when increasing the vCPU and nodes, you will actually pay the same price with a bigger cluster size.
You may then run a query from the
editor tab by connecting to the database you created just now. Any SQL query format is supported here. Note that, your built database will be stored in
public. You may also want to see your DB without browsing the Redshift in your browser. Remove the hassle by connecting the database in any database browsing apps, or simply connect with PyCharm. I use the PyCharm free version for my coding works.
2. Fundamentals of AI, ML and Deep Learning for Product Managers
3. Roadmap to Data Science
4. Work on Artificial Intelligence Projects
- Create a
Crawlersto crawl the data.
- Create a
Crawlersby following the steps in the window. You may choose
crawler source typeto the S3 bucket path that contains your data. Put the schedule as
Run on demand. For the
output, you may need to create one if you have none yet.
Connectionsdetail should be the same with the cluster created in Redshift. For the
connection's type, you may use
- Check the crawled data in
Tablestab. Ensure that Glue has successfully crawled the data and store it there.
Add jobto create a new job for Glue. Choose the same
IAM Rolethat we used before.
- If you are going to create multiple dynamic frames inside the job, change
max concurrencyto 5, 10, or 15 (depending on how many dynamic frames you are going to create).
- The data source is from the database’s table crawled in the Crawler.
Change schemain transform type.
- In choosing a data target, choose
create tables in your data target.Use
JDBCfor the required redshift connection.
- Check all the source columns and target columns mapping. Replace any dot
_. A column with
.need to have (`)at the beginning and at the end.
- Relationalize the data — follow this tutorial to relationalize the nested dictionary. Note that the mapping style is case sensitive, e.g.
orderis the name of the root table,
stockis the column,
.represent the dict, and
shippingOptionsis nested under the
- To create the development
endpoints, we need to use the same
Networking, click on
choose a connection, then choose the Redshift cluster associated with the database. You may skip the
ssh public key.
- To create a notebook connected with the endpoint above, you should use the
Workflowswill be the final step to schedule the
crawlerand run the
add workflowto add a new one, then, add
triggerby clicking the
add newfor trigger, use
start after ALL watched eventfor
- At the right of the trigger for
crawler, add the required
- Follow this link to read the Redshift data in the Sagemaker notebook.
- The details for the Redshift cluster is in this JSON file. The hostname is retrieved from
If you need to update your Redshift database, you may do so following this tutorial. You can also use
job bookmark function in Glue to avoid duplicates and Glue to track processed data.
This tutorial is written based on my experience in transferring data from S3 Bucket to Amazon Redshift by using AWS Glue. Do contact me if you need help to do this or you want to suggest another way to me 😀.