Project Background

The purpose of this project was to take NFL game data from the now deprecated NFL API extract the details, conduct required data transformations and store the data in AWS Redshift. I was going to use this project to enable me to further analyze NFL game data; however, the NFL has since restricted access to this dataset. As such it is impossible to replicate this project.

The end goal of this project is to have a database where it is possible to conduct complex aggregate based queries against NFL data. To do this I used game data available from the NFL API, but as I stated earlier this API is no longer available to unregistered users.

This project used several technologies:

  • Apache Airflow for the workflow management
  • AWS Lambda to fetch data, extract and transform data
  • AWS s3 for JSON storage
  • AWS Redshift used as the analytical database
  • Python was used as the primary programming language

The GitHub for this project is located here: NFL-Data-Pipeline

Project Datasets

The NFL datasets used in this picture come from the NFL Teams, Schedule, Coaches, Players, and Games come from the NFL endpoint.

Teams: The team data by year use to be found at https://www.nfl.com/feeds-rs/teams/.json. For example the data for all NFL teams for the year 2000 use to be found here https://www.nfl.com/feeds-rs/teams/2000.json. The following data is provided from this endpoint:

  • Season: Season of the record
  • Team ID: A unique ID for each team in the NFL.
  • Team Abbreviation: Commonly used team abbreviation for example the Denver Broncos is "DEN".
  • City/State: How the team is identified by location for example the Denver Broncos is "Denver".
  • Full Name: The full name of the team.
  • Nickname: Commonly nickname for the team.
  • Team Type: Used to annotate if the team is a regular season team or the "Pro Bowl Team"
  • Conference Abbreviation, ID, and Full Name
  • Division Abbreviation, ID, and Full Name
  • Year Founded
  • Stadium name
  • Ticket Phone Number
  • Team URL
  • Team Ticket URL

Schedule: The full schedule for the entire year was posted at https://www.nfl.com/feeds-rs/schedules/.json. For example the 2019 season use to be found here https://www.nfl.com/feeds-rs/schedules/2019.json. The only changes that are made during the year are for the post season game locations. The following data is provided from this endpoint:

  • Season:
  • Season Type: This annotates if the is a preseason, regular season or post season game.
  • Week:
  • Game ID: A unique ID used for each game.
  • Game Key: A unique ID used for each game
  • Game Time Eastern:
  • Game Time Local:
  • ISO TIME:
  • Home Team ID:
  • Visitor Team ID:
  • Home Team Abbr:
  • Visitor Team Abbr:
  • Home Display Name:
  • Visitor Display Name:
  • Home Nick Name:
  • Visitor Nick Name:
  • Game Type:
  • Week Name Abbrievation:
  • Week Name:
  • Visitor Team, Season, Team ID, City/State, Full Name, Nickname, Team Type, Conference Abbr, Division Abbrievation:
  • Home Team, Season, Team ID, City/State, Full Name, Nickname, Team Type, Conference Abbr, Division Abbrievation:
  • Site ID, Site City, Site Full Name, Site State, Roof type, and network channel:

Coaches : Information for each coach for each team by year use to be found at https://www.nfl.com/feeds-rs/coach/byTeam//.json. For example the Denver Broncos coach information for 2019 could be found here: https://www.nfl.com/feeds-rs/coach/byTeam/1400/2019.json. This contains basic biographical information for each coach. The following data is provided:

  • NFLID: This is a unique ID used for coaches and players.
  • Status: This indicates if the coach or player is active, retired, or free agent.
  • Display Name:
  • First Name:
  • Last Name:
  • ESBID: This is a unique ID used for coaches and players. This number is also used to generate the url for the picture used.
  • Birthdate:
  • Home Town:
  • Is Deceased:
  • Current Status: This indicates if the coach or player is active, retired, or free agent.
  • College ID:
  • College Name:

Players: Player information was found at https://www.nfl.com/feeds-rs/playerStats/.json. The player information for Von Miller could found at https://www.nfl.com/feeds-rs/playerStats/2495202.json. While this site provides some statistical information for the player this project will only use the following:

  • NFLID: This is a unique ID used for coaches and players.
  • Status: This indicates if the coach or player is active, retired, or free agent.
  • Display Name:
  • First Name:
  • Last Name:
  • ESBID: This is a unique ID used for coaches and players.
  • GSISIS: This is a unique ID used for coaches and players.
  • Middle Name:
  • Suffix:
  • Birth Date:
  • Home Town:
  • College ID:
  • College Name:
  • Position Group:
  • Position:
  • Jersey Number:
  • Height:
  • Weight:
  • Team ID:

Games: The site to get detailed game information was found at https://www.nfl.com/feeds-rs/boxscorePbp/.json. An example for this data source can be seen at: https://www.nfl.com/feeds-rs/boxscorePbp/2016020700.json. This data source was primarily to fill the fact table. This json document is deeply nested. As part of the data preprocessing the JSON document will be flattened to be placed in the staging table. The field from this data source which will be used are:

  • Season:
  • Season Type: This indicates if the game is Preseason, Regular Season, or Post Season.
  • Week:
  • Game Id: This is a unique identifier for each game.
  • Game Key: This is a unique identifier for each game.
  • Game Date:
  • ISO Time:
  • Home Team ID: This the unique id for the home team
  • Visitor Team ID: This is the unique id for the visitor team.
  • Game Type: This is the common name for the game, it can be Preseason, Regular Season, or even SB for Superbowl.
  • Phase: This indicates the current phase of the game. Unless the game is currently in session this should read "FINAL"
  • Visitor Team Score Total:
  • Visitor Team Score Q1:
  • Visitor Team Score Q2:
  • Visitor Team Score Q3:
  • Visitor Team Score Q4:
  • Visitor Team Score OT:
  • Home Team Score Total:
  • Home Team Score Q1:
  • Home Team Score Q2:
  • Home Team Score Q3:
  • Home Team Score Q4:
  • Home Team Score OT:
  • Drive Sequence: This is a unique identifier for the drive for this specific game.
  • Play ID: This is a unique identifier for this play for this drive sequence.
  • Scoring: This is a Boolean value for this play resulted in a scoring drive.
  • Scoring Team ID: This will indicate the ID of the scoring team if the play resulted in a score.
  • Possession Team ID: This indicates the ID of the team that currently on offense.
  • Play Type: This is a categorical variable which indicates the play type. Most common forms are Pass, Rush, and Kick-off.
  • Quarter:
  • Down:
  • Yards to Go:
  • First Down or Touch Down:
  • Play Stat Sequence: As a play can have multiple participants this provides a unique value for each portion of a play.
  • Stat ID: This is the Stat ID of the play. This can be cross referenced in the stat_codes.csv, which will be included as a dimension table.
  • NFLID: This the unique ID of the player involved with this specific portion o f the play.
  • Yards: This is the number of yards this player is responsible for. If they player had no yards the value will be zero.
  • Play Description: This is the number of yards this player is responsible for. If they player had no yards the value will be zero.
  • Play Video: If the play contains a highlight video there will be a direct link to the video here, this is an example.

STAT ID: This is a CSV containing the STAT ID codes and descriptions for each stat used in a game. The CSV can be seen here STAT ID. The fields are described below.

  • STAT ID: This is the unique identifier for each stat.
  • Name: This is the common name for the stat.
  • Comment: This is a detailed description of the stat.

Project Tools

This project incorporated tools from both the Udacity course as well as other AWS offerings. The tools used included:

  • Python
  • AWS S3
  • AWS Lambda
  • AWS Redshift
  • Apache Airflow

Python: Python was used to access the NFL endpoints and conduct the data transformations to load the data into Redshift staging tables as well as the fact and dimension tables. Python code was used on a local computer to initiate the data ingest and transformation, in AWS Lambda to access multiple endpoints simultaneously, and in Apache Airflow.

AWS s3: s3 cloud storage was used as a temporary storage location for the team data json, NFL schedule json, coach data json, player data json, and game data json files.

AWS Lambda: Lambda was used for a majority of the data transformation. Lambda is serverless in nature and can be configured to run python code. Lambda works very well on small pieces of code which can be fully executed in less than fifteen minutes. As the transformations for the NFL data was simple, but there were many files to transform Lambda was used to parallelize these data transformations. When a json file was loaded into an s3 bucket the appropriate Lambda would execute.

AWS Redshift: Redshift was used as the analytical database. Data was loaded into staging tables from s3 and then sql queries were used to move the data into the fact and dimension tables.

AWS Apache Airflow: Airflow was used to monitor and manage some of the data transformation workflows for this project.

Final Database Schema

The data model used is a simple star data model. This data model works well as it is easy to query with quick speeds. The table has not been taken to 3NF, while this increases redundancy it will result in faster return times on queries. Once the data has been preprocessed and placed in the correct s3 folder it can be copied into the the staging table. From the staging tables it will be inserted into the correct fact and dimension tables.

The mapping for S3 folder to staging table to fact or dim table is as follows:

s3://nfl-cap/coaches » coaches_staging » coaches_dim

s3://nfl-cap/gameinfo » gameinfo_staging » game_dim

s3://nfl-cap/schedules_details » schedules_details_staging » game_dim

s3://nfl-cap/gameplays » gameplays_staging » play_fact

s3://nfl-cap/player_info » player_info_staging » players_dim

s3://nfl-cap/statids/ » statid_codes_staging » statid_dim

s3://nfl-cap/teams » team_staging » team_dim

Staging Tables

There are seven staging tables for this project. They can be seen in this PDF

Coaches Staging Table

Field Data Type
coach_id varchar
season int
week varchar
display_name varchar
first_name varchar
last_name varchar
esbid varchar
status varchar
birthdate date
hometown varchar
collge varchar
team_id varchar
isdeceased varchar
pic_url varchar

Game Info Staging Table

Field Data Type
game_id varchar
season int
season_type varchar
week varchar
game_key varchar
game_date date
game_time_iso timestamp
vis_points_total int
vis_points_q1 int
vis_points_q2 int
vis_points_q3 int
vis_points_q4 int
vis_points_ot int
home_points_total int
home_points_q1 int
home_points_q2 int
home_points_q3 int
home_points_q4 int
home_points_ot int
win_team varchar
lose_team varchar
site_id varchar
site_city varchar
site_full_name varchar
site_state varchar
roof_type varchar
phase varchar

Game Plays Staging Table

Field Data Type
game_id varchar
week varchar
drive_seq varchar
play_id varchar
play_stat_id varchar
season int
home_team varchar
def_team varchar
off_team varchar
vis_team varchar
penalty varchar
scoring varchar
scoring_team varchar
play_type varchar
quarter varchar
down int
yard_to_go numeric
first_down varchar
play_descript varchar  
play_vid varchar
stat_id varchar
yards int
player_id varchar
player_team varchar

Player Info Staging Table

Field Data Type
nfl_id varchar
esb_id varchar
gsis_id varchar
status varchar
display_name varchar
first_name varchar
last_name varchar
middle_name varchar
suffix varchar
birth_date date
home_town varchar
college_id varchar
college_name varchar
position_group varchar
position varchar
jersey_number varchar
height int
weight int
current_team varchar
player_pic_url varchar

Statid Code Staging Table

Field Data Type
stat_id varchar
name varchar
comment varchar

Schedule Details Staging Table

Field Data Type
game_id varchar
season int
season_type varchar
week varchar
game_key varchar
home_id varchar
vis_id varchar
game_type varchar
week_name_abbr varchar
week_name varchar

Team Staging Table

Field Data Type
team_id varchar
season int
abbr varchar
citystate varchar
full_name varchar
nick varchar
team_type varchar
conference_abbr varchar
division_abbr varchar
year_found int
stadium_name varchar

Fact and Dim Tables

There are five dimension tables and 1 fact table. They can be seen in this PDF

Coaches Dim Table

Field Data Type Key
coach_id varchar PRIMARY
season int PRIMARY
week varchar PRIMARY
display_name varchar  
full_name varchar  
last_name varchar  
esbid varchar  
birthdate date  
hometown varchar  
college int  
team_id varchar  
isdeceased varchar  
pic_url varchar  

Players Dim Table

Field Data Type Key
nflid varchar PRIMARY
esbid varchar  
gsisid varchar  
status varchar  
display_name varchar  
first_name varchar  
last_name varchar  
middle_name date  
suffix varchar  
birthdate date  
hometown varchar  
college_id varchar  
college varchar  
position_group varchar  
position varchar  
jersey_number varchar  
height int  
weight int  
current_team varchar  
player_pic_url varchar  

Game Dim Table

Field Data Type Key
game_id varchar PRIMARY
season int  
season_type varchar  
week varchar  
game_key varchar  
game_date date  
game_time_iso timestamp  
vis_points_total int  
vis_points_q1 int  
vis_points_q2 int  
vis_points_q3 int  
vis_points_q4 int  
vis_points_ot int  
home_points_total int  
home_points_q1 int  
home_points_q2 int  
home_points_q3 int  
home_points_q4 int  
home_points_ot int  
win_team varchar  
lose_team varchar  
site_id varchar  
site_city varchar  
site_full_name varchar  
site_state varchar  
roof_type varchar  
game_phase varchar  
week_name_abbr varchar  
week_name varchar  
game_type varchar  
home_id varchar  
away_id varchar  

Team Dim Table

Field Data Type Key
team_id varchar PRIMARY
season int  
abbr varchar  
citystate varchar  
full_name varchar  
nick varchar  
team_type varchar  
conference_abbr varchar  
division_abbr varchar  
year_found int  
stadium_name varchar  

Stat ID Dim Table

Field Data Type Key
stat_id varchar PRIMARY
name int  
comment varchar  

Play Fact Table

Field Data Type KEY
guid sequence PRIMARY
game_id varchar  
week varchar  
drive_seq varchar  
play_id varchar  
play_stat_id varchar  
season int  
home_team varchar  
def_team varchar  
off_team varchar  
vis_team varchar  
penalty varchar  
scoring varchar  
scoring_team varchar  
play_type varchar  
quarter varchar  
down int  
yard_to_go numeric  
first_down varchar  
play_descript varchar  
play_vid varchar  
stat_id varchar  
yards int  
player_id varchar  
player_team varchar  

Project Process

There are two primary problems with this dataset. The first is gathering the data and the second is the primary dataset called games is a heavily nested json document and will require preprocessing before it can be loaded into a staging table. To gather a all the required json files the following process was used.

Data Preprocess

Step 1

The data gathering process is initiated with a python script where the user enters in the season of the NFL data they want to enter into the database. The script code can be viewed in the access_raw_data Jupyter notebook. This script will execute four function functions

  • The script will go to https://www.nfl.com/feeds-rs/schedules/.json and do the following:
    • Save each Game ID as a separate JSON in s3
    • Save Game detailed information in s3
  • For each team in the NFL schedule for that season the script will go to https://www.nfl.com/feeds-rs/teams/.json and get detailed information for each team for the season selected. Each team will be saved as a separate json in s3.

  • For each team in the NFL schedule the script will go to https://www.nfl.com/feeds-rs/coach/byTeam//.json and save a json file for each coach for each season. The json file is saved by team, season, week. If the script is run on a weekly basis it will be able to log coaching changes that occur during the regular season.

Step 2

The time to process a single season, deconstruct each game file save each play stat and player json file could easily take over an hour. By using AWS Lambda this time can be reduced to minutes. For step 2 for each gameid saved in s3 will start an AWS process which will download the complete game json and save it in another s3 bucket. The only downside to this is each s3 prefix can only have one lambda trigger. Also Lambda functions can run for a maximum of 15 minutes, so it is a best practice to keep Lambda functions simple. It takes less than 30 seconds for Lambda to open each json file and download the game file and save it in the correct s3 folder. The code for this Lambda function is saved in the lambda_functions folder under the name gameids.py.

Step 3

For each full game saved in s3, Lambda will launch and deconstruct each game saving each play stat as a separate json file in s3. This function will execute a second task of saving each player from the game as a single json document in a separate json document. The code for this function can be viewed in the lambda_functions folder under the name gamesheet_decon.py.

Step 4

For each Player ID saved Lambda will save the Player json file into an s3 bucket in a format that is ready to load into the Redshift staging table.

Step 5

Once the initial data transformation from has taken place and the de-nested json files are in the staging s3 buckets, the process can be started in Apache Airflow. The data will be copied from s3 into the Redshift staging tables.

Step 6

Once the data has been moved from each bucket to the appropriate staging table, the staging tables will begin to populate the fact and dimension tables.

Step 7

After all the staging tables have populated the fact and dimension tables a data quality check is conducted, and Airlow will stop the process.

DAG Chart

Lessons Learned/Final Thoughts

The goal of this project was to build a database from NFL endpoints. This database will allow people to conduct queries on players, teams, and coaches performances. This can be used by NFL fans as well as those who play fantasy football to help plan their team and track potential points. The primary tools for this project were Python, AWS Lambda, AWS s3, AWS Redshift, and Airflow. AWS Lambda was used to help preprocess data that could take hours if done sequentially, by using Lambda this data preprocessing was reduced to minutes. AWS s3 storage was used for the json files as the storage is cheap and easily connects to AWS Lambda and Redshift. AWS Redshift was used to store the data. By using a column store distributed database, data should be readily available and returned quickly. By using Airflow the ability to schedule the transfer of the data from S3 to staging tables and finally to the dimension and fact tables.

For me this project is bittersweet. After working with the NFL data for about a year, I finally found a way to easily move the data into a database which allowed for aggregate analysis only for the NFL to stop public access to the endpoints. This project allowed me to incorporate all the tools I had used in several other projects and was a great exercise.