Are Hockey Playoffs Actually Random?

CMSC320 Final Project by Matt Blodgett

Introduction

The NHL playoffs are often regarded as one of the most random sporting events. By random, I am referring to how the expected winner does not win the playoffs very often. For example, the team that has won the Presidents' trophy (team with the most points in regular season) has only won the playoffs 8 times out of the 35 years this award has been given out. Additionally, the last team to win the Presidents' Trophy and Stanley Cup was the Chicago Blackhawks in the 2012-2013 seasons. There are many possible explanations for this randomness. According to Braden Holtby, an NHL goalie, he attributes this randomness to the game being played and officiated differently. He said that penalties are not called as much as the referees tend to let more things go than in the regular season. This is because the referees do not want to impact the game as much and want the true best team to win. While this motivation is respectable, this officiating policy may award an advantage to more physical teams. Additionally, the NHL has an 82-game regular season schedule. Due to the physicality of this sport, teams cannot realistically perform their best every single game in the regular season. All this changes in the playoffs because teams do not know when their last game will be played. Furthermore, since each round of the playoffs is a best of 7 series, teams have more of an opportunity to scout their opponents. This is not done as much in the regular season due to teams having a densely packed schedule where games against different teams are played every couple of days. Finally, there is inherent “puck luck” in hockey because the puck bounces around a lot in this sport. A great example of this occurring recently is in the first round of the 2022 NHL Playoffs. The Washington Capitals was winning the Best of 7 series 2 games to 1 against the top seeded Florida Panthers and were winning Game 5, 2-1 with 2 minutes remaining in the game. The Panthers pulled their goalie to get an extra skater and Capitals' player Garnet Hathaway barely missed an empty net goal. Immediately after the Panthers scored and ended up winning in overtime and went on to win the series 4 games to 2. This highlights the "randomness" because if Hathaway were to score this goal the Capitals would have been winning the series 3-1 and would have been very likely to win the overall series.

All these factors attribute to the NHL playoffs being very random. However, when I went to pick a final project, I could not help but notice that there are repeat winners in the NHL playoffs. This includes the back-to-back Stanley Cup Winners: Tampa Bay Lightning (2020 and 2021) and the Pittsburgh Penguins (2016 and 2017). I believe that there could be a possible reason for this and that this reason could potentially be generalized to all winners.

So, for my final project, I wanted to see if it was true that the NHL playoffs were random or if there was some relation between how well teams do in the regular season vs how well they do in the playoffs. To start, I am going to assume the null hypothesis that the NHL playoffs are random, and that the alternative hypothesis is that there is a way to predict how well a team will do in the playoffs.

Data Collection + Processing

Before getting started here is the list of the required Python libraries:

  1. Pandas
  2. Requests
  3. Beautiful Soup
  4. Numpy
  5. SKLearn
  6. Statsmodels
  7. Matplotlib
  8. Scipy
  9. Random

For this project, all the data collected was from https://www.hockey-reference.com. This website contains all the regular season and playoff data for each team as well as individual game history for each team. The regular season table for each year consists of every team’s regular season wins, losses, points, goals scored, goalie save percentage, etc. Recall the goal is to see if any of these factors or maybe some combination of these factors has an impact on a team’s playoff performance. The playoff data for each year consists of all the statistics found in the regular season but for the playoffs. However, I only care about the number of wins a team has in the playoffs since I am trying to see how regular season performance relates to number of playoff wins. As a quick recap, the NHL playoffs currently work as 16 team bracket style playoffs where each round is a best of 7. Additionally, one can see what round a team gets to by this simple metric:

  1. Round 1: 0-3 wins
  2. Round 2: 4-7 wins
  3. Round 3: 8-11 wins
  4. Round 4: 12 - 15 wins
  5. Winner: 16 wins

So, if a team wins 16 games in the playoffs, then they win the Stanley Cup.

Here is example of the 2018 season statistics table, 2018 playoffs, and the Washington Capitals team data:

  1. 2018 Regular Season Team Statistics Table: https://www.hockey-reference.com/leagues/NHL_2018.html#stats
  2. 2018 Playoff Data: https://www.hockey-reference.com/playoffs/NHL_2018.html#teams
  3. 2018 Washington Capitals Data: https://www.hockey-reference.com/teams/WSH/2018_games.html#games

Disclaimer:

Before scraping the data, it should be noted that I was unable to get the regular season statistics table using traditional web scraping methods (pandas read_html and Beautiful Soup). This was because the Team Statistics table on the website is improperly formatted. Fortunately, this website has the option of manually downloading the table as an excel file, which you can then change to a csv file and then read in with pandas. This can be done by doing:

  1. https://www.hockey-reference.com/leagues/NHL_2018.html#stats (or the respective year you want to get the regular season data from)
  2. Go to Team Statistics table
  3. Share & Export
  4. Get as Excel Workbook
  5. Once downloaded, change the data type to a CSV file

These steps can then be repeated for each of the years you want to look at and all placed in a directory that can then be read in with pandas read_csv. Unfortunately, I cannot host the data on a GitHub page or anywhere else because this violates the websites terms of use. However, all the other data on the website I wanted was easily able to be read in using BeautifulSoup or pandas read_html. It should be noted that I chose to download the data manually because after many hours of trying to read in the table using various data science techniques it still would not work, and I found it was easier to download the data. Additionally, the professor said I was allowed to do this.

Getting Regular Season data from 2000-2022

The code below is for getting the regular season data for each team from the years 2000 to 2022. It is important to note that I had to download all of the statistic tables onto a directory that were then read in by pandas read_csv function. At the end, this code generates an array of dataframes for the regular season data from 2000-2022. I also saved data from the current season but stored that separately since that season is still being played.

Getting Playoff Data from 2000-2022

In the code below I will be getting all the playoff data from the 2000-2021 seasons and store the results as an array of dataframes. This is much easier to do than the previous step because I can use requests and BeautifulSoup to parse the playoff statistics table for each year

Combining Dataframes Together

Currently we have an array of dataframes for both the regular season and playoffs. To make data processing easier, we are going to combine all the dataframes into one and add in the number of wins each team had in the playoffs for the respective year. This will make it so I do not have to look at multiple dataframes anymore and can only focus on this one dataframe called combined_regular.

Additionally, I chose not to include the 2020 playoff season in the data. As you may recall, this was the covid year for sports and there were many outside factors involved in the playoffs such as the playoffs being played in a different format. For example, the team that won the playoffs that season won 18 games instead of the standard 16 found in the previous years. Also, there were no fans and all teams played in a bubble environment. So, for all these reasons, I thought this season did not represent the typical hockey playoffs, so chose not to include it.

Exploratory Data Analysis

Now that we have collected all the data needed for the analysis, it is time to look at some individual statistics in the regular season and see how that impacts a team's performance in the playoffs. The goal with this is that if we can identify any relation between one of these factors and playoff performance, it signals that we should look closer at these factors and hints that this may have a big impact on team’s playoff performance.

To start I am going to look at team’s regular season points percentage, goalie save percentage, roster age, goals scored per game, and win rate in last 10 games. For each of these categories, we are going to see how it impacts playoff performance.

Brief explanation of intuition behind looking at each category:

  1. Clearly regular season points percentage (average amount of games a team gets a point) should be the most important factor to a team’s success in playoffs. But this may not be true as seen with the Presidents' Trophy winner and their lack of success in the playoffs.
  2. Goalies save percentage is also seen as very important because critics often like to highlight that you need a great goalie to win in the playoffs.
  3. I thought it would be interesting to see if a team's average age had any impact on playoff performance. If it did in either way, then it could signal what teams should do with regards to age when building rosters.
  4. I also wanted to look at how average amount of goals scored per game impacts playoff performance because games are won based on who scores the most. As a result, it would make intuitive sense that teams that score a lot of goal per game should also be able to do well in the playoffs.

  5. Finally, I thought that a teams win percentage in the last 10 games would have a big impact on a team's playoff performance. This is because these last games are often seen as a tune up before the playoffs and it would make sense that teams that do well during this period would perform better in the playoffs.

It should be noted that going into this I do not think there will be strong relations for a single category and playoff performance. This is mainly because if there was then I think it would be talked about a lot more, and it does not make intuitive sense for a single statistic to shed a lot of insight into the complex sport of hockey.

Creating Playoff Performance Function:

This function plots the relationship between the category passed in the parameter (either PTS%, SV%, GF/G, AvAge, WinPCTLast10) and the number of playoff wins for each team over the years. Also uses statsmodels to compute information such as the p-value and r-squared for the linear regression.

The reason we conduct hypothesis testing using statsmodels is to see if any of these individual factors is significant enough (p-value less than 0.05) to say that playoff performance is not truly random. However, it should be noted that even if we find a p-value less than 0.05, it may not give a lot of insight into why this is happening and does not mean that the project is done.

Now can look at the plot for each individual category as well as the regression results:

PTS% vs Playoff Wins:

First looking at how regular season points percentage relates to number of wins in the playoffs.

PTS% VS PlayoffWins Explanation:

The first thing we see is that is all the data appears to be clustered around the .55 and .65 regular season win percentage range. This makes sense because these teams were all good enough to make the playoffs. Additionally, from first glance there does not seem to be any relation between the points percentage in regular season and number of playoff wins. The regression line indicates there is a relation, but the data does not appear to be related that well.

However, the regression results tell a different story. It says that the p-value is around 0.0 which is less than 0.05. This indicates that points percentage in regular season and number of playoff wins are highly likely to be related not due to random chance. So, from this we can technically reject the null hypothesis that says hockey playoffs are random and accept the alternative hypothesis. However, I would not feel comfortable doing this at this point. We can clearly see that although there is a relation, the relation does not look very strong, and the plot does not look linear. Additionally, the r-square percentage is around 6.2%. This means that regular season points percentage only explains 6.2% of the change in the number of playoff wins. So, while they are related and you could say that the NHL playoffs are not technically truly random, I don't think this model says much more other than that they are probably very close to being random at this point.

As a result, we will continue with the other regular season statistics to get additional insight.

Goalie Save Percentage vs Playoff Wins:

Now looking at how goalies save percentage in regular season impacts the team's number of wins in the playoffs.

Goalie Save Percentage vs Playoff Wins Explanation:

This plot looks even worse than the previous one and has a p-value of .09 so we can say that this alone does not have a lot of impact on team’s playoff wins. It should also be noted that a lot of the data is contained between .90 and .92 save percentage. This is considered good in the NHL and many goalies have this save percentage. So, I was not expecting this alone to be a big factor to determine number of wins in playoffs, but it could be when combined with additional factors. For example, a team that has a great goalie and scores a lot of goals per game.

Average Team Age vs Playoff Wins:

Now going to see hoow a team's age impacts number of playoff wins

Average Team Age vs Playoff Wins Explanation:

Just as seen previously, there does not seem to be strong enough evidence to support roster age impacting the number of playoff wins. This can be seen by looking at the graph and that the p-value is suggesting that there is a 46% chance that age does not have an impact on playoff performance. Additionally, can see that the r-squared value is almost 0 which is saying that there is almost a 0 percent chance this model accounts for the change in the playoff wins dependent variable.

Goals per Game vs Playoff Wins:

Now going to see a team's age impacts number of playoff wins

Goals per Game vs Playoff Wins Explanation:

Again, there is not enough evidence to support the claim that goals per game alone significantly impacts the number of playoff wins for each team. This is because the p-value is greater than 0.05. Additionally, can see that the r-squared is .01, which means this model is only able to explain 1 percent of the change in the y-axis. However, we can sort of see that although these statistics for the regular season may not directly impact number of playoff wins, if we put them all together, then it could give some insight.

Win Percentage in Last 10 Games vs Playoff Wins:

Finally, we are going to look at how a team's win percentage in the last 10 games impacts the number of playoff wins.

Win Percentage in Last 10 Games vs Playoff Wins Explanation:

This is surprising but there does not seem to be any significant individual impact on a team's performance in playoffs based on their win percentage in last 10 games. This can again be seen because of the p-value being .183 and the low r-squared value.

Going into this project I thought this would be one of the biggest impacts on a team's performance in the playoffs. This is mainly because sports casters in hockey typically highlight this as being very important, and that the playoffs are all about which team gets "hot". Additionally, it is interesting to see that one team lost its final 10 games and went on to win Stanley Cup (had 16 wins).

Finally, it should be noted that the graph looks odd because there are only 11 possible values a team can have for the x-axis, and only 17 possible values for the y-axis. This causes a lot of overlapping points.

Where Do We Go From Here

From the previous section, we saw that regular season statistics do not seem to have that much individual impact on a team's performance in playoffs (number of wins). Yes, we did see that we had sufficient evidence to reject the null hypothesis based on the regular season points percentage model. However, when we looked at the graph, the data appeared to be very scattered and not linear. Additionally, we got a very low r-squared value which tells us that there are additional factors we must consider. So basically, even though we saw that the playoffs are not truly "random", that model did not inspire much confidence that the NHL playoffs were not "basically random". In the case of sporting events "basically random" vs "random" feels like the same thing.

So, we clearly see that we need to look at additional factors. We could add many interaction terms to a linear regression model and see how that works, but that would get unnecessarily complex. Since there are only 17 possible different possible values for the number of wins a team can have in the playoffs (0-16), it seems like a better idea to switch to a machine learning decision tree approach to predict the number of wins a team will get in the playoffs.

To recap, looking at individual statistics are not good enough to predict team's playoff performance. Also, we are now no longer doing linear regression because we want to see how well we can predict the number of wins a team will get in playoffs. The linear regression and plots found above mainly served as a starting point so that we can see if there were any general trends.

Machine Learning

We will now be using machine learning to see if we can accurately predict the number of wins a team will get in the playoffs based on the many statistics collected for the regular season. This is more advanced than the previous section because we are now looking at a combination of factors instead of one and seeing how that relates to a team's playoff performance.

Predicting Number of Playoff Wins:

To start we are going to be using a random forest classifier and a decision tree classifier with the dependent data as the number of playoff wins and then the independent data as various statistics collected from the regular season. Here is a brief explanation of each statistic we are going to include for the independent data. In addition, it is important to note that this is all the data found from the regular season and we are trying to see if that impacts playoff performance:

  1. AvAge: Team's average age
  2. PTS%: Team's points percentage
  3. PIM/G: Team's penalty minutes per game
  4. S%: Team's shooting percentage (how often they score)
  5. SV%: Team's save percentage
  6. SRS: Team's simple rating system. This is a statistic from hockey reference that ranks teams on a scale from 0 to 1 based on their average goal differential and strength of schedule.
  7. SOS: Team's strength of schedule
  8. GF/G: Team's goals for per game
  9. GA/G: Teams' goals against per game
  10. PP%: Team's power play percentage (how often they score when have a power play)
  11. PK%: Team's penalty kill percentage (how often they do not get scored on when on a penalty kill)
  12. WinPctLast10: Team's win percentage in last 10 games
Predicting Number of Wins Explanation:

In the code above, we used cross_val_score() on a decision tree classifier and a random forest classifier to compute the accuracy of our two models where the dependent data is the number of playoff wins and the independent data is the regular season statistics mentioned above. The reason I used two classifiers is that I wanted to see if there was a big difference in the accuracy between the two. I also was only able to use a kfold value of 5 because of the small sample size of the data. I also did not tune the hyperparameters for each model because I just wanted to get a general sense of the accuracy for each model.

Now, from these results we see that the decision tree classifier has an accuracy of around 10 percent and the random forest classifier has an accuracy of around 16 percent. Additionally, the low standard error for both classifiers show low variation between the data.

This accuracy is obviously very bad and in general this model should not give you a lot of confidence of predicting the number of wins a team will get in the playoffs. However, we can see that if we computed a random accuracy for the number of wins for each team, we would get an accuracy of around 6 percent. While this is an oversimplification because not every team can have the same number of wins, it serves the general purpose of saying that this model is doing some learning and that it is not completely random.

So, from this information we can conduct a T-test on our model's predictions vs a random prediction and see if the results are statistically significant to reject the null hypothesis.

Conducting T-Test on Number of Wins Model

Now, we are going to conduct a t-test of our random forest classifier model vs a random model for generating the number of wins per team. It should be noted that we will use the random forest classifier because it is better from an accuracy standpoint.

T-Test Explanation

As we can see from running a t-test of our model vs a random one, we get a p-value of less than 0.05, which gives significant evidence to reject the null hypothesis. So, we can clearly say that our model is learning and that we have significant evidence to say hockey playoffs are not random.

It should be noted that because of the small dataset, the p-value changes from each iteration. However, it never got close to being near 0.05, so it does not bring cause for worry. In addition, I originally looked at running this test and then get the "average p-value" using a fisher test. However, I realized this was unnecessary because the p-value was always significantly smaller than 0.05.

Are We Done?

From the above section, we had significant evidence to say that the NHL playoffs are indeed not random and that there are some outside factors that contribute to the number of wins a team will have in the playoffs. However, we also saw that our model had only around a 16 percent accuracy rate. As stated earlier, this should not inspire much confidence in our model's prediction capabilities. As of right now, I would say that if the playoffs are not completely random then they are basically random.

After thinking about this, a possible reason for the low accuracy is that we have 17 possible outcomes for each team. For example, we are currently saying that a team with 1 win in the playoffs is completely different than a team with 2 wins in the playoffs. These teams are both the same because they both did not make it out of the first round. So, my idea is that we can revise the question from predicting the number of wins a team will have, to predicting what round a team will get to in the playoffs. This should make for better prediction capabilities and be more useful in general.

To recall, the playoffs currently work as a 16 team bracket style playoffs with each round being a best of 7 series. If this is confusing, it may help to realize that they work in the same way as the Sweet 16 in March Madness does, except it is a best of 7 series instead of a best of 1 series for each round. Here is a breakdown of the round a team will be in based on their number of playoff wins.

  1. Round 1: 0-3 wins
  2. Round 2: 4-7 wins
  3. Round 3: 8-11 wins
  4. Round 4: 12 - 15 wins
  5. Winner: 16 wins

Note, there is not actually a round 5, I just set it up this way so that the Stanley Cup Winner has a special label.

Predicting Playoff Round:

Now, we can do the same code done in the previous section but make the dependent variable the playoff round instead of number of playoff wins.

Predicting Playoff Round Explanation:

It looks like the idea of predicting based on round instead of number of wins turned out to result in a better accuracy for each model. Previously, we saw that the decision tree classifier gave an accuracy of around 10 percent, but now gives 33 percent accuracy. Similarly, we saw around 16 percent accuracy with random forest classifier before and now see around a 45 percent accuracy. This is overall great news. We also still see low standard error for both, which shows low variation between the data. Additionally, it is important to note that I did not tune the hyperparameters for the classifiers because I wanted to give a general sense of the accuracy of the classifiers. When testing, I played around with the min_samples_leaf parameter and saw that it gave me an accuracy of around 50 percent with min_samples_leaf set to 30+. However, it also made it so that it predicted each team being in either round 1 or round 2. This is a major flaw in the project that every team can get the same round prediction, but this will be explained more later.

Overall, I think it should still be noted that these accuracies are still not good. They are clearly better than random, but I still do not think these results should inspire much confidence. Regardless, I will still conduct a t-test of the new model and a random model, to see if this model if this model is significantly better than random.

Conducting T-Test on Number of Wins Model

Now, we are going to conduct a t-test of our random forest classifier model vs a random model for generating the playoff round per team. It should be noted that we will use the random forest classifier because it is better from an accuracy standpoint.

T-Test Explanation

As we can see from running a t-test of our model vs a random one, we get a p-value of less than 0.05, which gives significant evidence to reject the null hypothesis. So, we can clearly say that our model is learning and that we have significant evidence to say hockey playoffs are not random. This again is the same thing we saw in previous test.

Predicting Winner of 2022 Season

Before giving the conclusion to this project, I thought it would be fun to use this model and see if we can predict the winner of the 2022 playoffs, which are currently happening. Unfortunately, the Capitals are out of the playoffs so the playoffs are no longer fun to watch, but the prediction could still be interesting.

Also, it is important to note that the prediction we are about to do should not be taken very seriously because the model created has no way of distinguishing that only a certain number of teams can be in each round. Specifically, at the end of the playoffs there will be only 8 teams that ended up in round 1, 4 teams in round 2, 2 teams in round 3, 1 team in round 4 (second place), and 1 team in round 5 (the winner). I tried to investigate creating a bracket style classifier, but this was very complex and out of the scope of this project/class. To try and fix this limitation, I will run the prediction many times and see what round each team gets placed in the most. I will also be using the decision tree classifier because the random forest classifier could not predict a team being in a round over 3. Again, this makes sense because for every season 14/16 teams are placed in the first 3 rounds, which is about 88 percent of the data. As stated, before the classifiers have no way of knowing this, which causes the issue of not being able to predict higher rounds when generalizing the data. However, the Decision Tree Classifier gave worse accuracy but generated round predictions for each team, so going to use that. Again, this is just for fun and not really the point of the project at all. The purpose of the project was to see if the playoffs are random, which we already saw is not completely true.

Prediction Results

My model is saying that the Nashville Predators are the team that will win the Stanley Cup this year. Unfortunately, this is not possible since they just lost in the first round 4 games to 0. We can also see that the Panthers, Wild, and Lightning have the next best odds to win. From these teams, I think the Panthers will win because they unfortunately beat the Capitals in the first round. Interestingly, my model predicted the Capitals and Kings to be first round exits, which sadly just happened in real life.

Again, it should be noted that the model does not know of the playoff structure, which is why it says multiple teams can end up in the fourth round, when only 1 can (the runner up). Again, figuring out this bracket structure is out of the scope of this project. Additionally, the model weighted every category equal so something like penalty minutes per game was weighted the same as points percentage. In reality, points percentage is probably more important to a team’s success in playoffs than penalty minutes per game. However, figuring out what to weight individual statistics from the regular season is very complex and out of the scope of this project. This also possibly highlights why it said that the Nashville Predators would be Stanley Cup Champions this season as they had the most penalty minutes per game this season. So, this could highlight a possible trend found in previous years that teams with a lot of penalties perform well in playoffs. Again, looking at this is out of the scope of the project, but could be interesting to look at if I decided to do more research into this topic.

Conclusion

The main thing we saw from all this data analysis is that we had significant evidence to reject the null hypothesis that the NHL playoffs are random and accept the alternative hypothesis that there are some outside factors that exist in predicting how well a team will do in the playoffs. However, we also saw that our best model was only able to do predict what round a team will end up in the playoffs around 45 percent of the time. This is not very good and is not a model that should be used for betting purposes or for fun. Additionally, my results may not be valid because the models were not able to recognize the NHL bracket structure and assumed every regular season statistic had the same weight. For example, it assumed that average age of a team and points percentage had the same weight. However, accounting for these things added additional complexities that were out of the scope of the project. Additionally, when I did the t-tests I gave equal likelihood for the number of wins and rounds. This was done after a recommendation from the course staff but is not truly accurate because at the end of the playoffs there will only be 8 teams that end up in round 1, 4 in round 2, 2 in round 3, 1 in round 4 (second place), and 1 in round 5 (the winner). So, this could mean that my results are not statistically significant. Again, this is a bit out of the scope for this project, but it should still be noted that my results should be taken with a grain of salt.

My viewpoint is that it appears that hockey playoffs are not completely random but are basically random and anything can happen. I also think that something being completely random vs basically random is not a big difference for sports fans. However, the NHL being unpredictable is not a bad thing. The NHL playoffs and sports in general are so great to watch because anything can happen. If we were able to accurately predict the winner every time, there would be zero point for the athletes to play the game. Randomness is what makes sports exciting.

Overall, this project was very fun to do and gave a lot of insight into data science and the randomness of the NHL. In the future, I can look to make a model that considers the NHL bracket structure and look at additional possible factors that may impact a team's performance in the playoffs. This tutorial mostly served as a starting point that can be explored further with additional research. Thank you for reading, and I hope this tutorial gave you a bit of insight into the randomness of the NHL and maybe some data science techniques.

Refereneces

https://www.hockey-reference.com/

https://www.nbcsports.com/washington/capitals/why-are-stanley-cup-playoffs-so-much-different-regular-season

https://fivethirtyeight.com/features/apparently-the-regular-season-is-irrelevant-in-the-nhl/