Project Background

For this project the fictional company Sparkify has grown past the initial databases created in the PostgreSQL-Data-Modeling Project and now the company needs a more robust database to store the ever increasing data. The goal of this project is to create a database inside of AWS Redshift and populate it from log files stored in an s3 Bucket. This project also introduces the concept of infrastructure as code. Part of the project entails creating a python script to launch a new AWS Redshift cluster.

The GitHub for this project is located here: Redshift-Data-Warehouse-Project

Project Datasets

As with the PostgreSQL-Data-Modeling, there are two datasets for this project. The Song data and the log data. Both of these datasets are stored in s3 buckets provided by Udacity.

Song Dataset

Each song in the song dataset is stored as a separate JSON file. These JSON files are stored in an s3 partition based on the first three letters of each song’s track ID. Below is an example provided by Udacity:

song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json

An example of the JSON format for each song is:

{“num_songs”: 1, “artist_id”: “ARJIE2Y1187B994AB7”, “artist_latitude”: null, “artist_longitude”: null, “artist_location”: “”, “artist_name”: “Line Renaud”, “song_id”: “SOUPIRU12A6D4FA1E1”, “title”: “Der Kleine Dompfaff”, “duration”: 152.92036, “year”: 0} </code></pre>

The data schema for this JSON is:

Field Data Type
num_songs INT
artist_id VARCHAR
artist_latitude REAL
artist_longitude REAL
artist_location VARCHAR
artist_name VARCHAR
song_id VARCHAR
title VARCHAR
duration NUMERIC
year INT

Log Dataset

The logfiles for each day are stored in a JSON file. These JSON files are partitioned in s3 by year and month. An example of this formation is:

log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json

Each JSON file is actually a JSON line formatted. In a JSON line file each line is a valid JSON object; however, this means the actual file itself cannot be treated as a JSON file. To avoid errors while reading this file into Python it is recommend to use pandas with the following format:

A sample of this log data in a dataframe is shown below: Log File Example

The data schema for the log dataset is:

Field Data Type
artist VARCHAR
auth VARCHAR
firstName VARCHAR
gender VARCHAR
itemInSession INT
lastName VARCHAR
length REAL
level VARCHAR
location VARCHAR
method VARCHAR
page VARCHAR
registration NUMERIC
sessionId INT
song VARHCHAR
status INT
ts TIMESTAMP
userAgent VARCHAR
userId INT

Final Database Schema

Much like the PostgreSQL-Data-Modeling Project, the primary data model for this project will be a start schema. However, instead of reading data directly from the s3 buckets into the final database, this project will make use of a staging table to act as an intermediary between the s3 bucket and the final database.

The Staging Table

There are two staging tables staging_events and the staging_songs tables. These tables are to temporally hold data from the S2 Bucket before being transformed and inserted into the primary use tables.

The staging_songs table contains:

The staging_songs table contains:

Field Data Type
artist_id VARCHAR
artist_latitude NUMERIC
artist_location VARCHAR
artist_longitude NUMERIC
artist_name VARCHAR
duration FLOAT
num_songs INTEGER
song_id VARCHAR
title VARCHAR
year INTEGER

The staging_events table contains:

Field Data Type
artist VARCHAR
auth VARCHAR
first_name VARCHAR
gender VARCHAR
item_in_session INTEGER
last_name VARCHAR
length NUMERIC
level VARCHAR
location VARCHAR
method VARCHAR
page VARCHAR
registration BIGINT
session_id INTEGER
song VARCHAR
status INTEGER
ts TIMESTAMP
user_agent VARCHAR
user_id INTEGER

The use tables are the songplay_fact, time_dim, user_dim, song_dim, and artist_dim tables. These tables are in the

The time table which contains:

Field Data Type Key KEYDIST
start_time TIMESTAMP Primary SORTKEY/DISTKEY
hour INTEGER    
day INTEGER    
week INTEGER    
month INTEGER    
year INTEGER    
weekday INTEGER    

The users table which contains:

Field Data Type Key KEYDIST
user_id INTEGER Primary  
first_name VARCHAR    
last_name VARCHAR    
gender VARCHAR    
level VARCHAR    

The songs table which contains:

Field Data Type Key KEYDIST
song_id VARCHAR Primary  
title VARCHAR    
artist_id VARCHAR Foreign Key  
year INT    
duration NUMERIC    

The artists table which contains:

Field Data Type Key KEYDIST
artist_id VARCHAR Primary DISTKEY
name VARCHAR    
location VARCHAR    
latitude NUMERIC    
longitude NUMERIC    

The songplay table which contains:

Field Data Type Key KEYDIST
songplay_id INT Primary SORTKEY
start_time TIMESTAMP Foreign Key  
user_id INTEGER Foreign Key  
song_id VARCHAR Foreign Key  
artist_id VARCHAR Foreign Key  
session_id INT    
location VARCHAR    
user_agent VARCHAR    

Project Process

This project while similar to the PostgreSQL-Data-Modeling Project, the process for building and moving the data is different.

  1. Using the aws_setup.ipynb is used first to enter in the AWS credentials and start a Redshift cluster. The file used to store the AWS credentials and some cluster information is stored in the dwh.cfg. It is important to ensure the dwh.cfg file is added to the .gitignore file to avoid posting account credentials to a public repository.
  2. Next the create_tables.pyis run. This file will create the staging tables along with the final database table and associated relationships.
  3. Once the tables have been built the etl.py file is run. This file will:
    • Copy the JSON files from the s3 bucket to the staging tables
    • Insert the data from the staging tables into the correct table in the final database
  4. Ensure the data was correctly copied into the tables using the aws_data_test.ipynb
  5. If everything is correct the data and Redshift Clusters can be deleted using the last portion of the aws_setup.ipynb

Lessons Learned/Final Thoughts

I learned a couple important lessons during this project. The first is that your AWS credentials can be saved in your .ipynb_checkpoints file. I learned this after a push to my GitHub repository, were in about 15 seconds I have several emails from GitHub and Amazon, along with Amazon calling my phone number. There were several frantic minutes of removing tokens and deleting repositories. All in all it was a valuable lesson, that thankfully didn’t cost me anything but time.

The second part I really enjoyed was learning how to interact with AWS through pure code. This is where the infrastructure as code concept really hit home for me. Not only is this more convenient to start a cluster through some basic python code, it is also makes scaling infrastructure very easy. Through a simple code change a cluster can be started and stopped and can even be done based on conditions set in an if statement. This is a very powerful concept, and made me prioritize my learning of Docker, which I will discuss in a later blog post.

This is one of the projects where I started to incorporate the use of more complex systems. The use of s3 buckets, the use python to start and stop Redshift clusters, and staging tables. These are all concepts I have applied in other projects. Overall this was a challenging project as I had to learn new concepts. The easy part was moving the data from the staging table to the final database, the hard part was all the easy stuff before. But that is the point of Data Engineering, to quickly and efficiently move data from one place to another.