The NHL playoffs are often regarded as one of the most random sporting events. By random, I am referring to how the expected winner does not win the playoffs very often. For example, the team that has won the Presidents' trophy (team with the most points in regular season) has only won the playoffs 8 times out of the 35 years this award has been given out. Additionally, the last team to win the Presidents' Trophy and Stanley Cup was the Chicago Blackhawks in the 2012-2013 seasons. There are many possible explanations for this randomness. According to Braden Holtby, an NHL goalie, he attributes this randomness to the game being played and officiated differently. He said that penalties are not called as much as the referees tend to let more things go than in the regular season. This is because the referees do not want to impact the game as much and want the true best team to win. While this motivation is respectable, this officiating policy may award an advantage to more physical teams. Additionally, the NHL has an 82-game regular season schedule. Due to the physicality of this sport, teams cannot realistically perform their best every single game in the regular season. All this changes in the playoffs because teams do not know when their last game will be played. Furthermore, since each round of the playoffs is a best of 7 series, teams have more of an opportunity to scout their opponents. This is not done as much in the regular season due to teams having a densely packed schedule where games against different teams are played every couple of days. Finally, there is inherent “puck luck” in hockey because the puck bounces around a lot in this sport. A great example of this occurring recently is in the first round of the 2022 NHL Playoffs. The Washington Capitals was winning the Best of 7 series 2 games to 1 against the top seeded Florida Panthers and were winning Game 5, 2-1 with 2 minutes remaining in the game. The Panthers pulled their goalie to get an extra skater and Capitals' player Garnet Hathaway barely missed an empty net goal. Immediately after the Panthers scored and ended up winning in overtime and went on to win the series 4 games to 2. This highlights the "randomness" because if Hathaway were to score this goal the Capitals would have been winning the series 3-1 and would have been very likely to win the overall series.
All these factors attribute to the NHL playoffs being very random. However, when I went to pick a final project, I could not help but notice that there are repeat winners in the NHL playoffs. This includes the back-to-back Stanley Cup Winners: Tampa Bay Lightning (2020 and 2021) and the Pittsburgh Penguins (2016 and 2017). I believe that there could be a possible reason for this and that this reason could potentially be generalized to all winners.
So, for my final project, I wanted to see if it was true that the NHL playoffs were random or if there was some relation between how well teams do in the regular season vs how well they do in the playoffs. To start, I am going to assume the null hypothesis that the NHL playoffs are random, and that the alternative hypothesis is that there is a way to predict how well a team will do in the playoffs.
Before getting started here is the list of the required Python libraries:
For this project, all the data collected was from https://www.hockey-reference.com. This website contains all the regular season and playoff data for each team as well as individual game history for each team. The regular season table for each year consists of every team’s regular season wins, losses, points, goals scored, goalie save percentage, etc. Recall the goal is to see if any of these factors or maybe some combination of these factors has an impact on a team’s playoff performance. The playoff data for each year consists of all the statistics found in the regular season but for the playoffs. However, I only care about the number of wins a team has in the playoffs since I am trying to see how regular season performance relates to number of playoff wins. As a quick recap, the NHL playoffs currently work as 16 team bracket style playoffs where each round is a best of 7. Additionally, one can see what round a team gets to by this simple metric:
So, if a team wins 16 games in the playoffs, then they win the Stanley Cup.
Here is example of the 2018 season statistics table, 2018 playoffs, and the Washington Capitals team data:
Disclaimer:
Before scraping the data, it should be noted that I was unable to get the regular season statistics table using traditional web scraping methods (pandas read_html and Beautiful Soup). This was because the Team Statistics table on the website is improperly formatted. Fortunately, this website has the option of manually downloading the table as an excel file, which you can then change to a csv file and then read in with pandas. This can be done by doing:
These steps can then be repeated for each of the years you want to look at and all placed in a directory that can then be read in with pandas read_csv. Unfortunately, I cannot host the data on a GitHub page or anywhere else because this violates the websites terms of use. However, all the other data on the website I wanted was easily able to be read in using BeautifulSoup or pandas read_html. It should be noted that I chose to download the data manually because after many hours of trying to read in the table using various data science techniques it still would not work, and I found it was easier to download the data. Additionally, the professor said I was allowed to do this.
#necessary libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np
from matplotlib import pyplot as plt
import sklearn.model_selection as ms
from scipy import stats
import statsmodels.api as sm
from sklearn import linear_model
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import random
The code below is for getting the regular season data for each team from the years 2000 to 2022. It is important to note that I had to download all of the statistic tables onto a directory that were then read in by pandas read_csv function. At the end, this code generates an array of dataframes for the regular season data from 2000-2022. I also saved data from the current season but stored that separately since that season is still being played.
#hockey-reference.com associates team names with an abbreviation for individual team data links. The abbreviations they used were not standarized or could be found on website so I manually made a dictionary that associates the team
teams_with_abr = {'St. Louis Blues':'STL', 'Detroit Red Wings':'DET', 'Philadelphia Flyers' : 'PHI',
'New Jersey Devils' : 'NJD','Washington Capitals':'WSH', 'Dallas Stars':'DAL',
'Toronto Maple Leafs':'TOR', 'Florida Panthers' :'FLA', 'Colorado Avalanche' : 'COL',
'Ottawa Senators' : 'OTT' , 'Los Angeles Kings' : 'LAK' , 'Phoenix Coyotes' : 'PHX', 'Arizona Coyotes' : 'ARI',
'Pittsburgh Penguins' : 'PIT', 'Edmonton Oilers' : 'EDM' , 'San Jose Sharks' : 'SJS',
'Buffalo Sabres' : 'BUF', 'Carolina Hurricanes' : 'CAR', 'Vancouver Canucks' : 'VAN',
'Montreal Canadiens' : 'MTL' , 'Mighty Ducks of Anaheim' : 'MDA', 'Chicago Blackhawks' : 'CHI',
'Calgary Flames' : 'CGY' , 'New York Rangers' : 'NYR', 'Boston Bruins' :'BOS', 'Nashville Predators' : "NSH",
'New York Islanders' : 'NYI' , 'Tampa Bay Lightning' : 'TBL' , 'Atlanta Thrashers' : 'ATL' , 'Winnipeg Jets' : 'WPG',
'Vegas Golden Knights' : 'VEG', 'Seattle Kraken' : 'SEA', 'Anaheim Ducks' : 'ANA', 'Minnesota Wild' : 'MIN'
,'Columbus Blue Jackets':'CBJ'}
#all my data is from the year 2000 to 2022. However, the current 2022 season has not been
#finished so this data will be saved separately to later make predictions on who will win the stanley cup this year.
years = [i for i in range(2000,2023)]
regular_season = []
#saving the dataframe for current 2022 regular season separately
current_regular_season =[]
#looping through each year
for year in years:
# the nhl was not held in 2005 so ignore this year
if (year != 2005 ):
#had to donwnload data manually and then read it from a dictionary
data = f"data/nhl_{year}.csv"
#reading in csv file as pandas dataframe
curr_df = pd.read_csv(data,index_col=[0])
#had to tidy up some of the columns to make it more readable
curr_df = curr_df.rename(columns = {'Unnamed: 1':'blah'})
curr_df.columns = curr_df.iloc[0]
curr_df = curr_df.drop(curr_df.index[0])
curr_df = curr_df
curr_df = curr_df.rename(columns = {np.NaN:'Team'})
curr_df.columns
curr_df = curr_df[:-1]
# a * in a teams name means that are a playoff Team. So only want to save teams data from regular season that are playoff teams
#right before submitting project, the website updated and removed the * for each
#team that made the 2022 playoffs so had to add in this check to get the 16 teams with highest points
if (year == 2022):
curr_df = curr_df.iloc[0:16]
else:
curr_df = curr_df.loc[curr_df['Team'].str.contains("\*")]
#after filtering can strip the * from each teams name since only looking at playoff teams
curr_df['Team'] = curr_df['Team'].str.strip("*")
#converting desired columns into floats (previously were saved as strings)
curr_df[['AvAge', 'PTS%', 'PIM/G', 'S%', 'SV%','SRS', 'SOS', 'GF/G', 'GA/G', 'PP%', 'PK%']] = curr_df[['AvAge', 'PTS%', 'PIM/G', 'S%', 'SV%','SRS', 'SOS', 'GF/G', 'GA/G', 'PP%', 'PK%']].astype(float)
#converting desired columns into ints (previously were saved as strings)
curr_df[['W', 'L','OL','PTS','GF','GA']] = curr_df[['W', 'L','OL','PTS','GF','GA']].astype(int)
#creating additional columns that will be used later in project
curr_df['PlayoffWins'] = np.NAN
curr_df['WinPctLast10'] = np.NAN
curr_df['ID'] = np.NAN
curr_df['PlayoffRound'] = np.NAN
#getting a teams id, used because each teams website is based on their id
curr_df['ID'] = curr_df['Team'].map(teams_with_abr)
#getting each teams win_percentage in last 10 games (will be seen why later)
for tm in curr_df['ID']:
curr_team = pd.read_html(f"https://www.hockey-reference.com/teams/{tm}/{year}_games.html")
curr_team = curr_team[0].tail(10)
#for some reason the 2022 season column names are labeled differently
if (year == 2022):
curr_team = curr_team.rename(columns = {'Unnamed: 7':'Result'})
else:
curr_team = curr_team.rename(columns = {'Unnamed: 6':'Result'})
#gettings teams win losses and ties for last 10 games.
game_result = curr_team['Result'].value_counts()
#From 2000-2004 the NHL allowed ties so I had to check for each possible combination of this.
if 'T' in game_result.index and 'L' in game_result.index and 'W' in game_result.index:
win_percentage = game_result['W'] / (game_result['W'] + game_result['L'] + game_result['T'])
elif 'W' in game_result.index and 'L' in game_result.index:
win_percentage = game_result['W'] / (game_result['W'] + game_result['L'])
elif 'W' in game_result.index and 'T' in game_result.index:
win_percentage = game_result['W'] / (game_result['W'] + game_result['T'])
elif 'W' in game_result.index:
win_percentage = 1.0
else:
win_percentage = 0.0
#saving computed regular season win percentage in last 10 games to teams respective year
curr_df.loc[curr_df.loc[curr_df['ID'] == tm].index[0],['WinPctLast10']] = win_percentage
#saving current year
curr_df['year'] = year
curr_df = curr_df[['Team', 'AvAge', 'GP', 'W', 'L', 'OL', 'PTS', 'PTS%', 'GF',
'GF/G', 'GA/G', 'PP', 'PPO', 'PP%', 'PPA', 'PPOA','GA','S%', 'SRS', 'SOS',
'PK%', 'PIM/G','SV%','PlayoffWins', 'WinPctLast10', 'ID', 'year']]
#saved all the dataframes into an array of dataframes unless it is 2022 season since not yet completed
if (year != 2022):
regular_season.append(curr_df)
else:
current_regular_season.append(curr_df)
#example dataframe from the 2004 regular season
regular_season[4].head(100)
Rk | Team | AvAge | GP | W | L | OL | PTS | PTS% | GF | GF/G | ... | S% | SRS | SOS | PK% | PIM/G | SV% | PlayoffWins | WinPctLast10 | ID | year |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Detroit Red Wings | 31.9 | 82 | 48 | 21 | 2 | 109 | 0.665 | 255 | 3.11 | ... | 10.3 | 0.72 | -0.09 | 86.75 | 11.5 | 0.912 | NaN | 0.6 | DET | 2004 |
2 | Tampa Bay Lightning | 28.1 | 82 | 46 | 22 | 6 | 106 | 0.646 | 245 | 2.99 | ... | 10.0 | 0.56 | -0.09 | 84.89 | 11.6 | 0.908 | NaN | 0.5 | TBL | 2004 |
3 | Boston Bruins | 27.9 | 82 | 41 | 19 | 7 | 104 | 0.634 | 209 | 2.55 | ... | 8.4 | 0.29 | 0.03 | 83.58 | 14.5 | 0.918 | NaN | 0.6 | BOS | 2004 |
4 | San Jose Sharks | 26.8 | 82 | 43 | 21 | 6 | 104 | 0.634 | 219 | 2.67 | ... | 9.5 | 0.40 | -0.04 | 85.27 | 13.1 | 0.923 | NaN | 0.8 | SJS | 2004 |
5 | Toronto Maple Leafs | 30.7 | 82 | 45 | 24 | 3 | 103 | 0.628 | 242 | 2.95 | ... | 10.7 | 0.48 | 0.02 | 83.42 | 17.5 | 0.906 | NaN | 0.7 | TOR | 2004 |
6 | Ottawa Senators | 27.0 | 82 | 43 | 23 | 6 | 102 | 0.622 | 262 | 3.20 | ... | 10.8 | 0.88 | -0.01 | 83.57 | 15.3 | 0.907 | NaN | 0.5 | OTT | 2004 |
7 | Vancouver Canucks | 27.5 | 82 | 43 | 24 | 5 | 101 | 0.616 | 235 | 2.87 | ... | 9.9 | 0.52 | 0.02 | 86.11 | 15.2 | 0.911 | NaN | 0.6 | VAN | 2004 |
8 | Philadelphia Flyers | 29.7 | 82 | 40 | 21 | 6 | 101 | 0.616 | 229 | 2.79 | ... | 9.5 | 0.47 | -0.05 | 83.33 | 16.3 | 0.911 | NaN | 0.4 | PHI | 2004 |
9 | Colorado Avalanche | 27.9 | 82 | 40 | 22 | 7 | 100 | 0.610 | 236 | 2.88 | ... | 9.8 | 0.48 | 0.02 | 83.75 | 15.4 | 0.915 | NaN | 0.2 | COL | 2004 |
10 | New Jersey Devils | 28.8 | 82 | 43 | 25 | 2 | 100 | 0.610 | 213 | 2.60 | ... | 8.8 | 0.54 | -0.06 | 85.34 | 10.7 | 0.918 | NaN | 0.6 | NJD | 2004 |
11 | Dallas Stars | 30.6 | 82 | 41 | 26 | 2 | 97 | 0.591 | 194 | 2.37 | ... | 8.8 | 0.22 | -0.01 | 85.85 | 13.7 | 0.908 | NaN | 0.5 | DAL | 2004 |
12 | Calgary Flames | 26.9 | 82 | 42 | 30 | 3 | 94 | 0.573 | 200 | 2.44 | ... | 8.9 | 0.33 | 0.04 | 84.68 | 17.0 | 0.916 | NaN | 0.6 | CGY | 2004 |
13 | Montreal Canadiens | 28.0 | 82 | 41 | 30 | 4 | 93 | 0.567 | 208 | 2.54 | ... | 9.2 | 0.23 | 0.04 | 82.48 | 12.5 | 0.918 | NaN | 0.4 | MTL | 2004 |
14 | St. Louis Blues | 30.0 | 82 | 39 | 30 | 2 | 91 | 0.555 | 191 | 2.33 | ... | 8.6 | -0.09 | -0.01 | 84.55 | 15.2 | 0.906 | NaN | 0.6 | STL | 2004 |
15 | Nashville Predators | 26.5 | 82 | 38 | 29 | 4 | 91 | 0.555 | 216 | 2.63 | ... | 9.7 | -0.05 | -0.04 | 81.77 | 16.3 | 0.907 | NaN | 0.5 | NSH | 2004 |
16 | New York Islanders | 27.7 | 82 | 38 | 29 | 4 | 91 | 0.555 | 237 | 2.89 | ... | 10.1 | 0.29 | -0.04 | 85.52 | 14.0 | 0.906 | NaN | 0.6 | NYI | 2004 |
16 rows × 27 columns
In the code below I will be getting all the playoff data from the 2000-2021 seasons and store the results as an array of dataframes. This is much easier to do than the previous step because I can use requests and BeautifulSoup to parse the playoff statistics table for each year
count = 0
years = [i for i in range(2000,2022)]
playoffs = []
stanley_cup_winners = []
#looping through each year
for year in years:
#again NHL did not happen in 2005
if year != 2005:
#getting url
url = f"https://www.hockey-reference.com/playoffs/NHL_{year}.html"
info = requests.get(url).text
#parsing table
soup = BeautifulSoup(info,"html.parser")
table = soup.find("table", id="teams")
table_body = table.find('tbody')
data = []
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [i.text.strip() for i in cols]
data.append([i for i in cols if i])
#creating dataframe and labeling columns
df_table = pd.DataFrame(data[0:], columns=['Team', 'GP', 'W', 'L', 'T', 'OW', 'OL', 'W-L%', 'G' , 'GA', 'DIFF'])
#dropping last row since it computed averages for each columns, which I am not interested in
df_table = df_table[:-1]
#saving columns that are supposed to be ints as ints (previously strings)
df_table[['W','GP','L','T', 'OW', 'OL', 'G', 'GA', 'DIFF']] = df_table[['W','GP','L','T', 'OW', 'OL', 'G', 'GA', 'DIFF']].astype(int)
#saving columns that are supposed to be ints as ints (previously strings)
df_table[['W-L%']] = df_table[['W-L%']].astype(float)
#saving the season's stanley cup winner (aka the team with the most wins in the playoff 16)
lord_stanley = df_table.loc[df_table['W'] == df_table['W'].max()]['Team'][0]
#appending dataframe to the playoffs table
playoffs.append(df_table)
stanley_cup_winners.append(lord_stanley)
#prints the 2000 season dataframe
playoffs[0].head(20)
Team | GP | W | L | T | OW | OL | W-L% | G | GA | DIFF | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | New Jersey Devils | 23 | 16 | 7 | 0 | 1 | 1 | 0.696 | 61 | 39 | 22 |
1 | Dallas Stars | 23 | 14 | 9 | 0 | 2 | 1 | 0.609 | 52 | 46 | 6 |
2 | Colorado Avalanche | 17 | 11 | 6 | 0 | 1 | 1 | 0.647 | 43 | 32 | 11 |
3 | Philadelphia Flyers | 18 | 11 | 7 | 0 | 2 | 1 | 0.611 | 44 | 40 | 4 |
4 | Pittsburgh Penguins | 11 | 6 | 5 | 0 | 1 | 2 | 0.545 | 31 | 23 | 8 |
5 | Toronto Maple Leafs | 12 | 6 | 6 | 0 | 1 | 0 | 0.500 | 26 | 26 | 0 |
6 | Detroit Red Wings | 9 | 5 | 4 | 0 | 0 | 1 | 0.556 | 23 | 19 | 4 |
7 | San Jose Sharks | 12 | 5 | 7 | 0 | 0 | 0 | 0.417 | 27 | 37 | -10 |
8 | St. Louis Blues | 7 | 3 | 4 | 0 | 0 | 0 | 0.429 | 22 | 20 | 2 |
9 | Ottawa Senators | 6 | 2 | 4 | 0 | 0 | 1 | 0.333 | 10 | 17 | -7 |
10 | Edmonton Oilers | 5 | 1 | 4 | 0 | 0 | 0 | 0.200 | 11 | 14 | -3 |
11 | Phoenix Coyotes | 5 | 1 | 4 | 0 | 0 | 0 | 0.200 | 10 | 17 | -7 |
12 | Buffalo Sabres | 5 | 1 | 4 | 0 | 1 | 0 | 0.200 | 8 | 14 | -6 |
13 | Washington Capitals | 5 | 1 | 4 | 0 | 0 | 1 | 0.200 | 8 | 17 | -9 |
14 | Los Angeles Kings | 4 | 0 | 4 | 0 | 0 | 0 | 0.000 | 6 | 15 | -9 |
15 | Florida Panthers | 4 | 0 | 4 | 0 | 0 | 0 | 0.000 | 6 | 12 | -6 |
#printing out all the teams that have won the Stanley Cup in past years
stanley_cup_winners
['New Jersey Devils', 'Colorado Avalanche', 'Detroit Red Wings', 'New Jersey Devils', 'Tampa Bay Lightning', 'Carolina Hurricanes', 'Anaheim Ducks', 'Detroit Red Wings', 'Pittsburgh Penguins', 'Chicago Blackhawks', 'Boston Bruins', 'Los Angeles Kings', 'Chicago Blackhawks', 'Los Angeles Kings', 'Chicago Blackhawks', 'Pittsburgh Penguins', 'Pittsburgh Penguins', 'Washington Capitals', 'St. Louis Blues', 'Tampa Bay Lightning', 'Tampa Bay Lightning']
Currently we have an array of dataframes for both the regular season and playoffs. To make data processing easier, we are going to combine all the dataframes into one and add in the number of wins each team had in the playoffs for the respective year. This will make it so I do not have to look at multiple dataframes anymore and can only focus on this one dataframe called combined_regular.
Additionally, I chose not to include the 2020 playoff season in the data. As you may recall, this was the covid year for sports and there were many outside factors involved in the playoffs such as the playoffs being played in a different format. For example, the team that won the playoffs that season won 18 games instead of the standard 16 found in the previous years. Also, there were no fans and all teams played in a bubble environment. So, for all these reasons, I thought this season did not represent the typical hockey playoffs, so chose not to include it.
#adding playoff wins category into df and also combining all of the dataframes into one
for i in range(0,len(playoffs)):
if (i != 19):
for tm, pct in zip(regular_season[i]['Team'], regular_season[i]['PTS%']):
playoff_wins = playoffs[i].loc[playoffs[i]['Team'] == tm]['W'].values[0]
regular_season[i].loc[regular_season[i].loc[regular_season[i]['Team']==tm].index,'PlayoffWins'] = playoff_wins
#concat all dataframes together
combined_regular = pd.concat(regular_season)
#showing example of the dataframe
combined_regular.head(100)
Rk | Team | AvAge | GP | W | L | OL | PTS | PTS% | GF | GF/G | ... | S% | SRS | SOS | PK% | PIM/G | SV% | PlayoffWins | WinPctLast10 | ID | year |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | St. Louis Blues | 28.7 | 82 | 51 | 19 | 1 | 114 | 0.695 | 248 | 3.02 | ... | 10.1 | 1.08 | 0.07 | 87.83 | 13.6 | 0.909 | 3.0 | 0.5 | STL | 2000 |
2 | Detroit Red Wings | 30.7 | 82 | 48 | 22 | 2 | 108 | 0.659 | 278 | 3.39 | ... | 10.6 | 0.88 | 0.05 | 85.85 | 12.1 | 0.903 | 5.0 | 0.5 | DET | 2000 |
3 | Philadelphia Flyers | 29.0 | 82 | 45 | 22 | 3 | 105 | 0.640 | 237 | 2.89 | ... | 9.5 | 0.58 | -0.13 | 86.71 | 14.9 | 0.908 | 11.0 | 0.7 | PHI | 2000 |
4 | New Jersey Devils | 27.8 | 82 | 45 | 24 | 5 | 103 | 0.628 | 251 | 3.06 | ... | 9.2 | 0.46 | -0.13 | 87.54 | 15.8 | 0.903 | 16.0 | 0.5 | NJD | 2000 |
5 | Washington Capitals | 29.0 | 82 | 44 | 24 | 2 | 102 | 0.622 | 227 | 2.77 | ... | 10.0 | 0.28 | -0.13 | 86.22 | 12.0 | 0.915 | 1.0 | 0.5 | WSH | 2000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
16 | Tampa Bay Lightning | 29.3 | 82 | 43 | 33 | 6 | 92 | 0.561 | 246 | 3.00 | ... | 9.5 | -0.18 | -0.08 | 81.59 | 11.4 | 0.887 | 1.0 | 0.5 | TBL | 2006 |
1 | Buffalo Sabres | 27.2 | 82 | 53 | 22 | 7 | 113 | 0.689 | 298 | 3.63 | ... | 12.3 | 0.64 | -0.16 | 81.35 | 14.6 | 0.906 | 9.0 | 0.7 | BUF | 2007 |
2 | Detroit Red Wings | 32.3 | 82 | 50 | 19 | 13 | 113 | 0.689 | 252 | 3.07 | ... | 9.1 | 0.72 | 0.05 | 84.56 | 12.0 | 0.905 | 10.0 | 0.5 | DET | 2007 |
3 | Nashville Predators | 27.4 | 82 | 51 | 23 | 8 | 110 | 0.671 | 266 | 3.24 | ... | 11.8 | 0.78 | 0.05 | 85.90 | 14.4 | 0.919 | 1.0 | 0.5 | NSH | 2007 |
4 | Anaheim Ducks | 28.5 | 82 | 48 | 20 | 14 | 110 | 0.671 | 254 | 3.10 | ... | 9.8 | 0.67 | 0.06 | 85.12 | 17.8 | 0.912 | 16.0 | 0.5 | ANA | 2007 |
100 rows × 27 columns
Now that we have collected all the data needed for the analysis, it is time to look at some individual statistics in the regular season and see how that impacts a team's performance in the playoffs. The goal with this is that if we can identify any relation between one of these factors and playoff performance, it signals that we should look closer at these factors and hints that this may have a big impact on team’s playoff performance.
To start I am going to look at team’s regular season points percentage, goalie save percentage, roster age, goals scored per game, and win rate in last 10 games. For each of these categories, we are going to see how it impacts playoff performance.
Brief explanation of intuition behind looking at each category:
I also wanted to look at how average amount of goals scored per game impacts playoff performance because games are won based on who scores the most. As a result, it would make intuitive sense that teams that score a lot of goal per game should also be able to do well in the playoffs.
Finally, I thought that a teams win percentage in the last 10 games would have a big impact on a team's playoff performance. This is because these last games are often seen as a tune up before the playoffs and it would make sense that teams that do well during this period would perform better in the playoffs.
It should be noted that going into this I do not think there will be strong relations for a single category and playoff performance. This is mainly because if there was then I think it would be talked about a lot more, and it does not make intuitive sense for a single statistic to shed a lot of insight into the complex sport of hockey.
This function plots the relationship between the category passed in the parameter (either PTS%, SV%, GF/G, AvAge, WinPCTLast10) and the number of playoff wins for each team over the years. Also uses statsmodels to compute information such as the p-value and r-squared for the linear regression.
The reason we conduct hypothesis testing using statsmodels is to see if any of these individual factors is significant enough (p-value less than 0.05) to say that playoff performance is not truly random. However, it should be noted that even if we find a p-value less than 0.05, it may not give a lot of insight into why this is happening and does not mean that the project is done.
def playoff_performance(category, combined_regular_df, plot_title_name):
#defining a linear regression
lm = linear_model.LinearRegression()
#getting rid of nan values that appeared for missing playoff seasons. Specifically
#all the data for the 2020 covid season is NaN because I did not record that playoff data for reasons specfied above.
combined_regular_df = combined_regular_df.dropna(subset=['PlayoffWins'])
#x_data represents the category we are looking at
x_data = combined_regular_df[category].values.reshape(-1,1)
#going to be predicting on the number of playoff wins
y_data = combined_regular_df['PlayoffWins']
#fitting model to data
model = lm.fit(x_data, y_data)
#plotting data as well as adding regression line
plt.figure(figsize=(10,10))
plt.scatter(combined_regular_df[category], combined_regular_df['PlayoffWins'], color='blue')
plt.plot(x_data, model.predict(x_data), color='darkgoldenrod')
plt.title(f'Regular Season {plot_title_name} vs Number of Wins in Playoffs', fontsize=18)
plt.ylabel('Number of Wins in Playoffs', fontsize=12)
plt.xlabel(f'Regular Season {plot_title_name}', fontsize=12)
#printing out results from statsmodels
new_x = sm.add_constant(x_data)
regression = sm.OLS(y_data, new_x).fit()
plt.show()
print(f"test: {regression.summary()}")
First looking at how regular season points percentage relates to number of wins in the playoffs.
playoff_performance('PTS%', combined_regular, 'PTS%')
test: OLS Regression Results ============================================================================== Dep. Variable: PlayoffWins R-squared: 0.062 Model: OLS Adj. R-squared: 0.059 Method: Least Squares F-statistic: 20.85 Date: Sun, 15 May 2022 Prob (F-statistic): 7.10e-06 Time: 22:36:38 Log-Likelihood: -937.98 No. Observations: 320 AIC: 1880. Df Residuals: 318 BIC: 1887. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -9.2719 3.230 -2.871 0.004 -15.627 -2.917 x1 23.7697 5.205 4.566 0.000 13.528 34.011 ============================================================================== Omnibus: 32.922 Durbin-Watson: 2.028 Prob(Omnibus): 0.000 Jarque-Bera (JB): 41.246 Skew: 0.876 Prob(JB): 1.11e-09 Kurtosis: 2.851 Cond. No. 28.3 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The first thing we see is that is all the data appears to be clustered around the .55 and .65 regular season win percentage range. This makes sense because these teams were all good enough to make the playoffs. Additionally, from first glance there does not seem to be any relation between the points percentage in regular season and number of playoff wins. The regression line indicates there is a relation, but the data does not appear to be related that well.
However, the regression results tell a different story. It says that the p-value is around 0.0 which is less than 0.05. This indicates that points percentage in regular season and number of playoff wins are highly likely to be related not due to random chance. So, from this we can technically reject the null hypothesis that says hockey playoffs are random and accept the alternative hypothesis. However, I would not feel comfortable doing this at this point. We can clearly see that although there is a relation, the relation does not look very strong, and the plot does not look linear. Additionally, the r-square percentage is around 6.2%. This means that regular season points percentage only explains 6.2% of the change in the number of playoff wins. So, while they are related and you could say that the NHL playoffs are not technically truly random, I don't think this model says much more other than that they are probably very close to being random at this point.
As a result, we will continue with the other regular season statistics to get additional insight.
Now looking at how goalies save percentage in regular season impacts the team's number of wins in the playoffs.
playoff_performance('SV%', combined_regular, 'Save Percentage')
test: OLS Regression Results ============================================================================== Dep. Variable: PlayoffWins R-squared: 0.009 Model: OLS Adj. R-squared: 0.006 Method: Least Squares F-statistic: 2.947 Date: Sun, 15 May 2022 Prob (F-statistic): 0.0870 Time: 22:36:38 Log-Likelihood: -946.66 No. Observations: 320 AIC: 1897. Df Residuals: 318 BIC: 1905. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -45.1798 29.484 -1.532 0.126 -103.188 12.829 x1 55.5822 32.379 1.717 0.087 -8.122 119.286 ============================================================================== Omnibus: 37.715 Durbin-Watson: 1.989 Prob(Omnibus): 0.000 Jarque-Bera (JB): 49.067 Skew: 0.957 Prob(JB): 2.21e-11 Kurtosis: 2.858 Cond. No. 227. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This plot looks even worse than the previous one and has a p-value of .09 so we can say that this alone does not have a lot of impact on team’s playoff wins. It should also be noted that a lot of the data is contained between .90 and .92 save percentage. This is considered good in the NHL and many goalies have this save percentage. So, I was not expecting this alone to be a big factor to determine number of wins in playoffs, but it could be when combined with additional factors. For example, a team that has a great goalie and scores a lot of goals per game.
Now going to see hoow a team's age impacts number of playoff wins
playoff_performance('AvAge', combined_regular, 'Average Roster Age')
test: OLS Regression Results ============================================================================== Dep. Variable: PlayoffWins R-squared: 0.002 Model: OLS Adj. R-squared: -0.001 Method: Least Squares F-statistic: 0.5573 Date: Sun, 15 May 2022 Prob (F-statistic): 0.456 Time: 22:36:38 Log-Likelihood: -947.86 No. Observations: 320 AIC: 1900. Df Residuals: 318 BIC: 1907. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 0.8239 6.177 0.133 0.894 -11.330 12.978 x1 0.1636 0.219 0.746 0.456 -0.268 0.595 ============================================================================== Omnibus: 36.582 Durbin-Watson: 1.975 Prob(Omnibus): 0.000 Jarque-Bera (JB): 46.967 Skew: 0.933 Prob(JB): 6.33e-11 Kurtosis: 2.791 Cond. No. 665. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Just as seen previously, there does not seem to be strong enough evidence to support roster age impacting the number of playoff wins. This can be seen by looking at the graph and that the p-value is suggesting that there is a 46% chance that age does not have an impact on playoff performance. Additionally, can see that the r-squared value is almost 0 which is saying that there is almost a 0 percent chance this model accounts for the change in the playoff wins dependent variable.
Now going to see a team's age impacts number of playoff wins
playoff_performance('GF/G', combined_regular, 'Goals per Game')
test: OLS Regression Results ============================================================================== Dep. Variable: PlayoffWins R-squared: 0.010 Model: OLS Adj. R-squared: 0.007 Method: Least Squares F-statistic: 3.205 Date: Sun, 15 May 2022 Prob (F-statistic): 0.0744 Time: 22:36:38 Log-Likelihood: -946.53 No. Observations: 320 AIC: 1897. Df Residuals: 318 BIC: 1905. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 0.7001 2.656 0.264 0.792 -4.525 5.925 x1 1.6202 0.905 1.790 0.074 -0.160 3.401 ============================================================================== Omnibus: 36.146 Durbin-Watson: 1.975 Prob(Omnibus): 0.000 Jarque-Bera (JB): 46.356 Skew: 0.928 Prob(JB): 8.59e-11 Kurtosis: 2.818 Cond. No. 33.3 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Again, there is not enough evidence to support the claim that goals per game alone significantly impacts the number of playoff wins for each team. This is because the p-value is greater than 0.05. Additionally, can see that the r-squared is .01, which means this model is only able to explain 1 percent of the change in the y-axis. However, we can sort of see that although these statistics for the regular season may not directly impact number of playoff wins, if we put them all together, then it could give some insight.
Finally, we are going to look at how a team's win percentage in the last 10 games impacts the number of playoff wins.
playoff_performance('WinPctLast10', combined_regular, 'Win Percentage Last 10 Games')
test: OLS Regression Results ============================================================================== Dep. Variable: PlayoffWins R-squared: 0.006 Model: OLS Adj. R-squared: 0.002 Method: Least Squares F-statistic: 1.785 Date: Sun, 15 May 2022 Prob (F-statistic): 0.183 Time: 22:36:39 Log-Likelihood: -947.24 No. Observations: 320 AIC: 1898. Df Residuals: 318 BIC: 1906. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 4.1256 1.012 4.077 0.000 2.135 6.116 x1 2.3212 1.738 1.336 0.183 -1.097 5.740 ============================================================================== Omnibus: 38.251 Durbin-Watson: 1.959 Prob(Omnibus): 0.000 Jarque-Bera (JB): 49.991 Skew: 0.966 Prob(JB): 1.40e-11 Kurtosis: 2.869 Cond. No. 8.77 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This is surprising but there does not seem to be any significant individual impact on a team's performance in playoffs based on their win percentage in last 10 games. This can again be seen because of the p-value being .183 and the low r-squared value.
Going into this project I thought this would be one of the biggest impacts on a team's performance in the playoffs. This is mainly because sports casters in hockey typically highlight this as being very important, and that the playoffs are all about which team gets "hot". Additionally, it is interesting to see that one team lost its final 10 games and went on to win Stanley Cup (had 16 wins).
Finally, it should be noted that the graph looks odd because there are only 11 possible values a team can have for the x-axis, and only 17 possible values for the y-axis. This causes a lot of overlapping points.
From the previous section, we saw that regular season statistics do not seem to have that much individual impact on a team's performance in playoffs (number of wins). Yes, we did see that we had sufficient evidence to reject the null hypothesis based on the regular season points percentage model. However, when we looked at the graph, the data appeared to be very scattered and not linear. Additionally, we got a very low r-squared value which tells us that there are additional factors we must consider. So basically, even though we saw that the playoffs are not truly "random", that model did not inspire much confidence that the NHL playoffs were not "basically random". In the case of sporting events "basically random" vs "random" feels like the same thing.
So, we clearly see that we need to look at additional factors. We could add many interaction terms to a linear regression model and see how that works, but that would get unnecessarily complex. Since there are only 17 possible different possible values for the number of wins a team can have in the playoffs (0-16), it seems like a better idea to switch to a machine learning decision tree approach to predict the number of wins a team will get in the playoffs.
To recap, looking at individual statistics are not good enough to predict team's playoff performance. Also, we are now no longer doing linear regression because we want to see how well we can predict the number of wins a team will get in playoffs. The linear regression and plots found above mainly served as a starting point so that we can see if there were any general trends.
We will now be using machine learning to see if we can accurately predict the number of wins a team will get in the playoffs based on the many statistics collected for the regular season. This is more advanced than the previous section because we are now looking at a combination of factors instead of one and seeing how that relates to a team's playoff performance.
To start we are going to be using a random forest classifier and a decision tree classifier with the dependent data as the number of playoff wins and then the independent data as various statistics collected from the regular season. Here is a brief explanation of each statistic we are going to include for the independent data. In addition, it is important to note that this is all the data found from the regular season and we are trying to see if that impacts playoff performance:
#dropping 2020 playoffs
combined_regular = combined_regular.dropna(subset=['PlayoffWins'])
#independent data
x_data = combined_regular[['AvAge', 'PTS%', 'PIM/G', 'S%', 'SV%','SRS', 'SOS', 'GF/G', 'GA/G', 'PP%', 'PK%', 'WinPctLast10']]
#dependent data
y_data = combined_regular['PlayoffWins']
#using both decision tree classifier and random forest classifer
dtc = DecisionTreeClassifier()
rfc = RandomForestClassifier()
#using cross_val_score and accuracy
scoresDTC = ms.cross_val_score(dtc,x_data,y_data,cv=5,scoring='accuracy')
scoresRFC = ms.cross_val_score(rfc,x_data,y_data,cv=5,scoring='accuracy')
#looking at accuracy, error, and standarized error of the two models
print(f"DTC: Mean Accuracy: {scoresDTC.mean()}, Error : {1 - scoresDTC.mean()}, Std: {scoresDTC.std()/ np.sqrt(5)}")
print(f"RFC: Mean Accuracy: {scoresRFC.mean()}, Error : {1 - scoresRFC.mean()}, Std: {scoresRFC.std()/ np.sqrt(5)}")
DTC: Mean Accuracy: 0.10625, Error : 0.89375, Std: 0.020916500663351885 RFC: Mean Accuracy: 0.190625, Error : 0.809375, Std: 0.014921670482891652
In the code above, we used cross_val_score() on a decision tree classifier and a random forest classifier to compute the accuracy of our two models where the dependent data is the number of playoff wins and the independent data is the regular season statistics mentioned above. The reason I used two classifiers is that I wanted to see if there was a big difference in the accuracy between the two. I also was only able to use a kfold value of 5 because of the small sample size of the data. I also did not tune the hyperparameters for each model because I just wanted to get a general sense of the accuracy for each model.
Now, from these results we see that the decision tree classifier has an accuracy of around 10 percent and the random forest classifier has an accuracy of around 16 percent. Additionally, the low standard error for both classifiers show low variation between the data.
This accuracy is obviously very bad and in general this model should not give you a lot of confidence of predicting the number of wins a team will get in the playoffs. However, we can see that if we computed a random accuracy for the number of wins for each team, we would get an accuracy of around 6 percent. While this is an oversimplification because not every team can have the same number of wins, it serves the general purpose of saying that this model is doing some learning and that it is not completely random.
So, from this information we can conduct a T-test on our model's predictions vs a random prediction and see if the results are statistically significant to reject the null hypothesis.
Now, we are going to conduct a t-test of our random forest classifier model vs a random model for generating the number of wins per team. It should be noted that we will use the random forest classifier because it is better from an accuracy standpoint.
rfc = RandomForestClassifier()
#redefining data for clarity purposes
x_data = combined_regular[['AvAge', 'PTS%', 'PIM/G', 'S%', 'SV%','SRS', 'SOS', 'GF/G', 'GA/G', 'PP%', 'PK%', 'WinPctLast10']]
y_data = combined_regular['PlayoffWins']
#used cross val predict becuase it gives a list of predicted values
predicted = ms.cross_val_predict(rfc,x_data,y_data,cv=5)
#random_acc list
random_acc = []
#represents accuracy of predicted list
accuracy = []
#looping through actual wins per team and predicted wins per team by model
for actual, curr_prediction in zip(y_data, predicted):
#giving a score of 1 if made correct prediction or 0 otherwise
if actual==curr_prediction:
accuracy.append(1)
else:
accuracy.append(0)
#computing the random value
random_val = random.randint(0,16)
if (random_val == actual):
random_acc.append(1)
else:
random_acc.append(0)
#conducting t test and returning p-value
result = stats.ttest_rel(accuracy,random_acc)
print(f"P-value of our model vs random model: {result.pvalue}")
P-value of our model vs random model: 5.141375877977624e-05
As we can see from running a t-test of our model vs a random one, we get a p-value of less than 0.05, which gives significant evidence to reject the null hypothesis. So, we can clearly say that our model is learning and that we have significant evidence to say hockey playoffs are not random.
It should be noted that because of the small dataset, the p-value changes from each iteration. However, it never got close to being near 0.05, so it does not bring cause for worry. In addition, I originally looked at running this test and then get the "average p-value" using a fisher test. However, I realized this was unnecessary because the p-value was always significantly smaller than 0.05.
From the above section, we had significant evidence to say that the NHL playoffs are indeed not random and that there are some outside factors that contribute to the number of wins a team will have in the playoffs. However, we also saw that our model had only around a 16 percent accuracy rate. As stated earlier, this should not inspire much confidence in our model's prediction capabilities. As of right now, I would say that if the playoffs are not completely random then they are basically random.
After thinking about this, a possible reason for the low accuracy is that we have 17 possible outcomes for each team. For example, we are currently saying that a team with 1 win in the playoffs is completely different than a team with 2 wins in the playoffs. These teams are both the same because they both did not make it out of the first round. So, my idea is that we can revise the question from predicting the number of wins a team will have, to predicting what round a team will get to in the playoffs. This should make for better prediction capabilities and be more useful in general.
To recall, the playoffs currently work as a 16 team bracket style playoffs with each round being a best of 7 series. If this is confusing, it may help to realize that they work in the same way as the Sweet 16 in March Madness does, except it is a best of 7 series instead of a best of 1 series for each round. Here is a breakdown of the round a team will be in based on their number of playoff wins.
Note, there is not actually a round 5, I just set it up this way so that the Stanley Cup Winner has a special label.
Now, we can do the same code done in the previous section but make the dependent variable the playoff round instead of number of playoff wins.
#defining what round a team will make it into playoffs based on number of playoff wins
combined_regular['PlayoffRound'] = (combined_regular['PlayoffWins']//4)+1
#getting same data as before
x_data = combined_regular[['AvAge', 'PTS%', 'PIM/G', 'S%', 'SV%','SRS', 'SOS', 'GF/G', 'GA/G', 'PP%', 'PK%', 'WinPctLast10']]
y_data = combined_regular['PlayoffRound']
#defining classifiers
dtc = DecisionTreeClassifier()
rfc = RandomForestClassifier()
#using cross val score. This time can do 10-fold cross validation since will have more data inside each bucket becuase have less overall buckets
scoresDTC = ms.cross_val_score(dtc,x_data,y_data,cv=10,scoring='accuracy')
scoresRFC = ms.cross_val_score(rfc,x_data,y_data,cv=10,scoring='accuracy')
#calculating accuracy
print(f"DTC: Mean Accuracy: {scoresDTC.mean()}, Error : {1 - scoresDTC.mean()}, Std Error: {scoresDTC.std()/ np.sqrt(10)}")
print(f"RFC: Mean Accuracy: {scoresRFC.mean()}, Error : {1 - scoresRFC.mean()}, Std Error: {scoresRFC.std()/ np.sqrt(10)}")
DTC: Mean Accuracy: 0.328125, Error : 0.671875, Std Error: 0.014823176532039278 RFC: Mean Accuracy: 0.4625, Error : 0.5375, Std Error: 0.02891258376555094
It looks like the idea of predicting based on round instead of number of wins turned out to result in a better accuracy for each model. Previously, we saw that the decision tree classifier gave an accuracy of around 10 percent, but now gives 33 percent accuracy. Similarly, we saw around 16 percent accuracy with random forest classifier before and now see around a 45 percent accuracy. This is overall great news. We also still see low standard error for both, which shows low variation between the data. Additionally, it is important to note that I did not tune the hyperparameters for the classifiers because I wanted to give a general sense of the accuracy of the classifiers. When testing, I played around with the min_samples_leaf parameter and saw that it gave me an accuracy of around 50 percent with min_samples_leaf set to 30+. However, it also made it so that it predicted each team being in either round 1 or round 2. This is a major flaw in the project that every team can get the same round prediction, but this will be explained more later.
Overall, I think it should still be noted that these accuracies are still not good. They are clearly better than random, but I still do not think these results should inspire much confidence. Regardless, I will still conduct a t-test of the new model and a random model, to see if this model if this model is significantly better than random.
Now, we are going to conduct a t-test of our random forest classifier model vs a random model for generating the playoff round per team. It should be noted that we will use the random forest classifier because it is better from an accuracy standpoint.
# same as the code in previous section
combined_regular['PlayoffRound'] = (combined_regular['PlayoffWins']//4)+1
x_data = combined_regular[['AvAge', 'PTS%', 'PIM/G', 'S%', 'SV%','SRS', 'SOS', 'GF/G', 'GA/G', 'PP%', 'PK%', 'WinPctLast10']]
y_data = combined_regular['PlayoffRound']
rfc = RandomForestClassifier()
#again using cross_val_predict becuase gives the prediction for all values
predicted = ms.cross_val_predict(rfc,x_data,y_data,cv=10)
random_acc = []
accuracy = []
#computing accuracy of our model and accuracy of random model.
for actual, curr_prediction in zip(y_data, predicted):
if actual==curr_prediction:
accuracy.append(1)
else:
accuracy.append(0)
#now random value can be between 1 and 5 for round
random_val = random.randint(1,5)
if (random_val == actual):
random_acc.append(1)
else:
random_acc.append(0)
#conducting t test and returning p-value
result = stats.ttest_rel(accuracy,random_acc)
print(f"P-value of our model vs random model: {result.pvalue}")
P-value of our model vs random model: 1.436309807381069e-10
As we can see from running a t-test of our model vs a random one, we get a p-value of less than 0.05, which gives significant evidence to reject the null hypothesis. So, we can clearly say that our model is learning and that we have significant evidence to say hockey playoffs are not random. This again is the same thing we saw in previous test.
Before giving the conclusion to this project, I thought it would be fun to use this model and see if we can predict the winner of the 2022 playoffs, which are currently happening. Unfortunately, the Capitals are out of the playoffs so the playoffs are no longer fun to watch, but the prediction could still be interesting.
Also, it is important to note that the prediction we are about to do should not be taken very seriously because the model created has no way of distinguishing that only a certain number of teams can be in each round. Specifically, at the end of the playoffs there will be only 8 teams that ended up in round 1, 4 teams in round 2, 2 teams in round 3, 1 team in round 4 (second place), and 1 team in round 5 (the winner). I tried to investigate creating a bracket style classifier, but this was very complex and out of the scope of this project/class. To try and fix this limitation, I will run the prediction many times and see what round each team gets placed in the most. I will also be using the decision tree classifier because the random forest classifier could not predict a team being in a round over 3. Again, this makes sense because for every season 14/16 teams are placed in the first 3 rounds, which is about 88 percent of the data. As stated, before the classifiers have no way of knowing this, which causes the issue of not being able to predict higher rounds when generalizing the data. However, the Decision Tree Classifier gave worse accuracy but generated round predictions for each team, so going to use that. Again, this is just for fun and not really the point of the project at all. The purpose of the project was to see if the playoffs are random, which we already saw is not completely true.
prediction_results = []
num_times = 100
#running prediction 100 times
for i in range(0,num_times):
x_data = combined_regular[['AvAge', 'PTS%', 'PIM/G', 'S%', 'SV%','SRS', 'SOS', 'GF/G', 'GA/G', 'PP%', 'PK%', 'WinPctLast10']]
y_data = combined_regular['PlayoffRound']
dtc = DecisionTreeClassifier()
#fitting previous years data
dtc.fit(x_data, y_data)
#2022 regular season data
x_data_curr = current_regular_season[0][['AvAge', 'PTS%', 'PIM/G', 'S%', 'SV%','SRS', 'SOS', 'GF/G', 'GA/G', 'PP%', 'PK%', 'WinPctLast10']]
prediction_results.append(dtc.predict(x_data_curr))
#computing sum of all the results for each team
col_totals = [ sum(x) for x in zip(*prediction_results) ]
#dividing each column by num_times and rounding it to get true round prediction
final_pred = [round(x / num_times) for x in col_totals]
#associating prediction results with team name and displaying that as a dateframe:
current_regular_season[0]['RoundPrediction'] = final_pred
prediction_df = current_regular_season[0][['Team','RoundPrediction']]
#sorting rankings
prediction_df = prediction_df.sort_values(by=['RoundPrediction'], ascending=False)
prediction_df.head(100)
Rk | Team | RoundPrediction |
---|---|---|
15 | Nashville Predators | 5 |
1 | Florida Panthers | 4 |
5 | Minnesota Wild | 4 |
8 | Tampa Bay Lightning | 4 |
2 | Colorado Avalanche | 3 |
3 | Carolina Hurricanes | 2 |
4 | Toronto Maple Leafs | 2 |
9 | New York Rangers | 2 |
10 | Boston Bruins | 2 |
12 | Pittsburgh Penguins | 2 |
16 | Dallas Stars | 2 |
6 | Calgary Flames | 1 |
7 | St. Louis Blues | 1 |
11 | Edmonton Oilers | 1 |
13 | Washington Capitals | 1 |
14 | Los Angeles Kings | 1 |
My model is saying that the Nashville Predators are the team that will win the Stanley Cup this year. Unfortunately, this is not possible since they just lost in the first round 4 games to 0. We can also see that the Panthers, Wild, and Lightning have the next best odds to win. From these teams, I think the Panthers will win because they unfortunately beat the Capitals in the first round. Interestingly, my model predicted the Capitals and Kings to be first round exits, which sadly just happened in real life.
Again, it should be noted that the model does not know of the playoff structure, which is why it says multiple teams can end up in the fourth round, when only 1 can (the runner up). Again, figuring out this bracket structure is out of the scope of this project. Additionally, the model weighted every category equal so something like penalty minutes per game was weighted the same as points percentage. In reality, points percentage is probably more important to a team’s success in playoffs than penalty minutes per game. However, figuring out what to weight individual statistics from the regular season is very complex and out of the scope of this project. This also possibly highlights why it said that the Nashville Predators would be Stanley Cup Champions this season as they had the most penalty minutes per game this season. So, this could highlight a possible trend found in previous years that teams with a lot of penalties perform well in playoffs. Again, looking at this is out of the scope of the project, but could be interesting to look at if I decided to do more research into this topic.
The main thing we saw from all this data analysis is that we had significant evidence to reject the null hypothesis that the NHL playoffs are random and accept the alternative hypothesis that there are some outside factors that exist in predicting how well a team will do in the playoffs. However, we also saw that our best model was only able to do predict what round a team will end up in the playoffs around 45 percent of the time. This is not very good and is not a model that should be used for betting purposes or for fun. Additionally, my results may not be valid because the models were not able to recognize the NHL bracket structure and assumed every regular season statistic had the same weight. For example, it assumed that average age of a team and points percentage had the same weight. However, accounting for these things added additional complexities that were out of the scope of the project. Additionally, when I did the t-tests I gave equal likelihood for the number of wins and rounds. This was done after a recommendation from the course staff but is not truly accurate because at the end of the playoffs there will only be 8 teams that end up in round 1, 4 in round 2, 2 in round 3, 1 in round 4 (second place), and 1 in round 5 (the winner). So, this could mean that my results are not statistically significant. Again, this is a bit out of the scope for this project, but it should still be noted that my results should be taken with a grain of salt.
My viewpoint is that it appears that hockey playoffs are not completely random but are basically random and anything can happen. I also think that something being completely random vs basically random is not a big difference for sports fans. However, the NHL being unpredictable is not a bad thing. The NHL playoffs and sports in general are so great to watch because anything can happen. If we were able to accurately predict the winner every time, there would be zero point for the athletes to play the game. Randomness is what makes sports exciting.
Overall, this project was very fun to do and gave a lot of insight into data science and the randomness of the NHL. In the future, I can look to make a model that considers the NHL bracket structure and look at additional possible factors that may impact a team's performance in the playoffs. This tutorial mostly served as a starting point that can be explored further with additional research. Thank you for reading, and I hope this tutorial gave you a bit of insight into the randomness of the NHL and maybe some data science techniques.