Data Warehouse with AWS Redshift

Project Background

For this project the fictional company Sparkify has grown past the initial databases created in the PostgreSQL-Data-Modeling Project and now the company needs a more robust database to store the ever increasing data. The goal of this project is to create a database inside of AWS Redshift and populate it from log files stored in an s3 Bucket. This project also introduces the concept of infrastructure as code. Part of the project entails creating a python script to launch a new AWS Redshift cluster.

The GitHub for this project is located here: Redshift-Data-Warehouse-Project

Project Datasets

As with the PostgreSQL-Data-Modeling, there are two datasets for this project. The Song data and the log data. Both of these datasets are stored in s3 buckets provided by Udacity.

Song Dataset

Each song in the song dataset is stored as a separate JSON file. These JSON files are stored in an s3 partition based on the first three letters of each song’s track ID. Below is an example provided by Udacity:

song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json

An example of the JSON format for each song is:

{“num_songs”: 1, “artist_id”: “ARJIE2Y1187B994AB7”, “artist_latitude”: null, “artist_longitude”: null, “artist_location”: “”, “artist_name”: “Line Renaud”, “song_id”: “SOUPIRU12A6D4FA1E1”, “title”: “Der Kleine Dompfaff”, “duration”: 152.92036, “year”: 0} </code></pre>

The data schema for this JSON is:

Field	Data Type
num_songs	INT
artist_id	VARCHAR
artist_latitude	REAL
artist_longitude	REAL
artist_location	VARCHAR
artist_name	VARCHAR
song_id	VARCHAR
title	VARCHAR
duration	NUMERIC
year	INT

Log Dataset

The logfiles for each day are stored in a JSON file. These JSON files are partitioned in s3 by year and month. An example of this formation is:

log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json

Each JSON file is actually a JSON line formatted. In a JSON line file each line is a valid JSON object; however, this means the actual file itself cannot be treated as a JSON file. To avoid errors while reading this file into Python it is recommend to use pandas with the following format:

A sample of this log data in a dataframe is shown below: Log File Example

The data schema for the log dataset is:

Field	Data Type
artist	VARCHAR
auth	VARCHAR
firstName	VARCHAR
gender	VARCHAR
itemInSession	INT
lastName	VARCHAR
length	REAL
level	VARCHAR
location	VARCHAR
method	VARCHAR
page	VARCHAR
registration	NUMERIC
sessionId	INT
song	VARHCHAR
status	INT
ts	TIMESTAMP
userAgent	VARCHAR
userId	INT

Final Database Schema

Much like the PostgreSQL-Data-Modeling Project, the primary data model for this project will be a start schema. However, instead of reading data directly from the s3 buckets into the final database, this project will make use of a staging table to act as an intermediary between the s3 bucket and the final database.

The Staging Table

There are two staging tables staging_events and the staging_songs tables. These tables are to temporally hold data from the S2 Bucket before being transformed and inserted into the primary use tables.

The staging_songs table contains:

Field	Data Type
artist_id	VARCHAR
artist_latitude	NUMERIC
artist_location	VARCHAR
artist_longitude	NUMERIC
artist_name	VARCHAR
duration	FLOAT
num_songs	INTEGER
song_id	VARCHAR
title	VARCHAR
year	INTEGER

The staging_events table contains:

Field	Data Type
artist	VARCHAR
auth	VARCHAR
first_name	VARCHAR
gender	VARCHAR
item_in_session	INTEGER
last_name	VARCHAR
length	NUMERIC
level	VARCHAR
location	VARCHAR
method	VARCHAR
page	VARCHAR
registration	BIGINT
session_id	INTEGER
song	VARCHAR
status	INTEGER
ts	TIMESTAMP
user_agent	VARCHAR
user_id	INTEGER

The use tables are the songplay_fact, time_dim, user_dim, song_dim, and artist_dim tables. These tables are in the

The time table which contains:

Field	Data Type	Key	KEYDIST
start_time	TIMESTAMP	Primary	SORTKEY/DISTKEY
hour	INTEGER
day	INTEGER
week	INTEGER
month	INTEGER
year	INTEGER
weekday	INTEGER

The users table which contains:

Field	Data Type	Key
user_id	INTEGER	Primary
first_name	VARCHAR
last_name	VARCHAR
gender	VARCHAR
level	VARCHAR

The songs table which contains:

Field	Data Type	Key
song_id	VARCHAR	Primary
title	VARCHAR
artist_id	VARCHAR	Foreign Key
year	INT
duration	NUMERIC

The artists table which contains:

Field	Data Type	Key	KEYDIST
artist_id	VARCHAR	Primary	DISTKEY
name	VARCHAR
location	VARCHAR
latitude	NUMERIC
longitude	NUMERIC

The songplay table which contains:

Field	Data Type	Key	KEYDIST
songplay_id	INT	Primary	SORTKEY
start_time	TIMESTAMP	Foreign Key
user_id	INTEGER	Foreign Key
song_id	VARCHAR	Foreign Key
artist_id	VARCHAR	Foreign Key
session_id	INT
location	VARCHAR
user_agent	VARCHAR

Project Process

This project while similar to the PostgreSQL-Data-Modeling Project, the process for building and moving the data is different.

Using the aws_setup.ipynb is used first to enter in the AWS credentials and start a Redshift cluster. The file used to store the AWS credentials and some cluster information is stored in the dwh.cfg. It is important to ensure the dwh.cfg file is added to the .gitignore file to avoid posting account credentials to a public repository.
Next the create_tables.pyis run. This file will create the staging tables along with the final database table and associated relationships.
Once the tables have been built the etl.py file is run. This file will:
- Copy the JSON files from the s3 bucket to the staging tables
- Insert the data from the staging tables into the correct table in the final database
Ensure the data was correctly copied into the tables using the aws_data_test.ipynb
If everything is correct the data and Redshift Clusters can be deleted using the last portion of the aws_setup.ipynb

Lessons Learned/Final Thoughts

I learned a couple important lessons during this project. The first is that your AWS credentials can be saved in your .ipynb_checkpoints file. I learned this after a push to my GitHub repository, were in about 15 seconds I have several emails from GitHub and Amazon, along with Amazon calling my phone number. There were several frantic minutes of removing tokens and deleting repositories. All in all it was a valuable lesson, that thankfully didn’t cost me anything but time.

The second part I really enjoyed was learning how to interact with AWS through pure code. This is where the infrastructure as code concept really hit home for me. Not only is this more convenient to start a cluster through some basic python code, it is also makes scaling infrastructure very easy. Through a simple code change a cluster can be started and stopped and can even be done based on conditions set in an if statement. This is a very powerful concept, and made me prioritize my learning of Docker, which I will discuss in a later blog post.

This is one of the projects where I started to incorporate the use of more complex systems. The use of s3 buckets, the use python to start and stop Redshift clusters, and staging tables. These are all concepts I have applied in other projects. Overall this was a challenging project as I had to learn new concepts. The easy part was moving the data from the staging table to the final database, the hard part was all the easy stuff before. But that is the point of Data Engineering, to quickly and efficiently move data from one place to another.

Share on

Twitter Facebook LinkedIn