Group 317-5: Stephen Ling, Lewis Clay Ballard, Hongwei Tian, Chester Zhang
Introduction
Hockey is a sport in which two teams play against each other by trying to maneuver a ball or a puck into the opponent’s goal using a hockey stick. We chose to focus on hockey because we were all interested in sports and it is easy to obtain a comprehensive data frame of different sports leagues. In this project, we plan to answer the questions: is there a trend in the birthdays of NHL players? Will taking more shots in a competition increase the goal percentage? What is the relationship between penalty minutes of NHL players and their age? What are the body features (Weight, Height, BMI) of NHL players in different positions? Does the experience in the league relate to the goal percentage of NHL players?
Based on our knowledge of Hockey (before analyzing the data), we expect there is a trend in birthdays of NHL players that the number of players increases when birthdays of players become closer to January in a year; taking more shots in a competition is associated with the goal percentage; the younger players tend to receive more penalty minutes; there are some outstanding body features of NHL players at different positions; finally, experience in the league is associated with the goal percentage of NHL players.
In general, there are significant trends in the features (birthday, BMI, height, weight) of NHL players, and some factors have a strong correlation with NHL players’ performances, but some factors do not.
Background
Data
The data set is collected by a Hockey fan on Kaggle. The major source of data is from different sports websites about Hockey. The data frame contains 40 columns and 27319 rows, including 40 variables and 3,340 players in National Hockey League (NHL) from 1976 to 2020. We use 14 variables from the data set:
Name
, name of the player.
Date_of_birth
, the player’s birthday.
Goals
, number of goals the player made in each season year.
Assists
, number of assists the player made in each season year.
Points
, points = goals + assists, which measures a player’s general performance in each season year.
Penalty_Minutes
, the total minutes a player spends in the penalty box.
Shots_on_Goal
, the number of shots player takes in each season year.
Shooting_Percentage
, shooting percentage = goals / shots on goal, which measures the goal ratio of the player in each season year.
Position
, the position the player plays at each season year, including Center, Defense, Forward, Goaltender, Left Wing, Right Wing.
Height
, the measured height of the player in each season year.
Weigth
, the measured weight of the player in each season year.
Body_Mass_Index
, the measured BMI of the player in each season year.
Age
, the age of the player in each season year.
Experience
, the year(s) the player has been in the NHL.
Time_on_Ice_per_Game
, the time (in the form “minute:second”) that the player is on the ice per game in each season year.
Source of Data
Background Information
To help better understand our analysis of data, we would like to illustrate some terms in background information part.
- The National Hockey League (NHL) is an organization of professional ice hockey teams in North America, formed in 1917. The NHL became the strongest league in North America in 1926.
- Body Mass Index (BMI) is a person’s weight in kilograms divided by the square of height in meters. A high BMI can be an indicator of high body fatness.
- General Rules of Hockey: Hockey players can only hit the puck with their stick. A goal can only be scored either from a field goal, a powerplay (caused by an opponent’s penalty), or from a penalty shot. Hockey players may not trip, push, charge, interfere with, or excessively physically handle an opponent in any way.
- A penalty minute is a punishment in hockey for an infringement of the rules. A player cannot participate in the match for a certain amount of time, depending on the severity of the infraction, and most penalty minutes are caused by physical conflict during the match.
Unusual Influencing Factors
- There are missing values in the data, and even though we drop missing values during analysis, this may still influence our interpretation of results.
- Some players play very little or not at all during a season year, which is an unusual influencing factor: these outliers may affect the distribution of data, which may affect our interpretation of results.
- The team environment is also an unusual influencing factor: a vigorous team may help players unlock their potential, and a passive team may affect players’ performance negatively.
- Finally, some outliers in scoring goals, weight, height, and BMI will also affect the distribution of the whole data set, which may affect our regression analysis and interpretation of body features of NHL players.
Focuses
- We have two general focuses in our analysis of the data. The first focus is the features of NHL players, including birthdays and body features (weight, height, BMI). The second focus is exploring factors that influence NHL players’ performance, including shooting rate vs. goal percentage and experience of NHL players vs. goal percentage.
Analysis
Trend in Birthdays of NHL Players
- We were inspired to explore the birthdays of NHL players because a statistics book titled Outliers mentions that Canadian hockey players’ birthdays are greatly affected by the age limitation of hockey player selection. So, we first transform the column
Date_of_birth
from the character into the date format. Then, we group the data frame by players’ names and birthdays and summarize the count of birthdays. Finally, we create a histogram on the Month of players’ birthdays.
- From the histogram we created, it is clear that the number of players who were born in January is over 350 (highest), and there is a significant decreasing trend in players’ birth month from January to August. So, it is reasonable to conclude there is a trend in birthdays of NHL players that the number of players increases when birthdays become closer to January.
Body Features of NHL Players
- For the body features of NHL players, we decided to explore the height, weight, and BMI in five positions including Center, Defense, Forward, Goaltender, Left-Wing, and Right-Wing. We first group the data by position. Then, we create boxplots on body features faceted by players’ positions.
- Over 75% of NHL players’ heights are over 180cm, which is almost 5 foot 11 inches (2 inches above the average height for a US male). The median heights of Center, Forward, Goaltender, Left Wing, Right Wing are similar and close to 185cm. The median height of Defense players is about the same as the upper quartile (the upper 25% of their heights) of other positions. This indicates that players in the Defense position have a higher distribution of heights and are therefore more likely to be taller than other positions.
- The weights of most NHL players are similar, with the medians for all positions being in the range of 85-95 kg or 187-210 lbs, which is in the range of the average male weight (197 lbs). While there isn’t too much variation by position, the median weights for Center, Forward, and Goaltender players are around the 3rd quartile (the lower 25% of their weights) of the rest of the positions’ (Defense, Left-Wing, Right-Wing) weight distributions, showing that players at Defense, Left-Wing, Right-Wing are more likely to weigh more than players at other positions.
- We do not find any significant features through comparing the BMI of each position. Each position’s median BMI is at or above 25, which is defined as overweight by the CDC. Interestingly, The overall variation in each position’s BMI is smaller (the quartiles are closer together in value) than other body measurements (where height and weight varied over 5-10 units, BMI varies over 2-5 units). This may be due to how the BMI is calculated, or due to the workouts and diets of the players to maximize their bodies.
Factors Influencing Goal Percentage
- To explore the relationship between shooting rate (shots per second) and goal percentage, we first filter out the players who have never taken a shot or attended the match to avoid their influence on the data. Then, we group the data by players’ names and summarize the number of shots they take in each season year, the goal percentage in each season year, and the time on the field in each season year. Then, we calculate their shooting rate by dividing the number of shots in each season year by the time on the field. Finally, we use a scatter plot with a linear regression line to reflect the relationship between shooting rate (shots made per second) and goal percentage.
From the scatter plot and regression line, we find shooting rate may not have a significant effect on goal percentage because the points in the scatter plot do not have an obvious linear trend; also, the absolute value of the slope of the regression line is quite small. Moreover, it seems that most shooting rates are around 0.0 shots/s to 0.2 shots/s, and most goal percentages are less than 25%. Besides, most players score goals in the range of 0-200, and only a few players score goals over 400.
To have a better understanding of the relationship between the shooting rate (shots made per second) and goal percentage, we calculate the correlation coefficient to quantify the strength of a linear relationship between shooting rate and goal percentage. For this, we extract the x and y values from the previous data frame and calculate the correlation coefficient with the given formula.
The r value is 0.2410339. This reflects there is a small correlation between the shooting rate (shots made per second) and goal percentage. So, it is reasonable to conclude that we do not find a strong linear relationship between shooting rate (shots made per second) and goal percentage.
Penalty Minutes
- To explore the relationship between the age of NHL players and penalty minutes, we first group the data by players’ ages (We do not group by players’ names because players’ age change in different season years). Then, we create a scatter plot to see the distribution of penalty minutes of players in different age.
It is clear that players in their late 20s (around 27 years old) obtain the highest penalty minutes, and the highest penalty minutes decreases as the age becomes greater than 27 years old.
The scatter plot can only find the extreme values of penalty minutes at different ages, which cannot reflect a condition of the whole population at different ages. So, we sum up the penalty minutes at different ages and calculate the mean. Then, we create a scatter plot to reflect the distribution of mean penalty minutes of players at different ages.
- From this graph, we find that the NHL players at 27 years old have the highest mean penalty minute. Moreover, the mean penalty minute decreases as the ages of players become greater or smaller than 27 years old. There are also some outliers around 45 years old.
Hypothesis Test
In order to find the difference between goal percentage of experienced players and goal percentage of not experienced players, we first design a hypothesis test to see if the difference exists.
B: the experiment has binary outcomes, which are “score a goal” and “not score a goal.”
I: assumed that all trials are independent
N: the sample size is fixed
S: assume each trial has the same probability of success
\(\hat{p}_{\text{experienced}}\) is the probability that experienced player makes a goal.
\(\hat{p}_{\text{not experienced}}\) is the probability that non-experienced player makes a goal.
\(X_1\) is the number of goals experienced player makes.
\(X_2\) is the number of goals non-experienced player makes.
the statistical model for \(X_1\) is \(X_1 ∣ {p}_{\text{experienced}} ∼ Binomial(733724,p)\)
the statistical model for \(X_2\) is \(X_2 ∣ {p}_{\text{not experienced}} ∼ Binomial(1175564,p)\)
\[
H_0: {p}_{\text{experienced}} = {p}_{\text{not experienced}} \\
H_a: {p}_{\text{experienced}} \neq {p}_{\text{not experienced}}
\]
Statistics Values of Hypothesis Test
-0.0035607 |
192282 |
1175564 |
733724 |
1909288 |
0.1007087 |
0.0004494 |
- Based on these values, we find that the p_value is 2.301138e-15 which is small enough to reject the \(H_0: {p}_{\text{experienced}} = {p}_{\text{not experienced}}\), so it is reasonable to conclude the difference between the goal percentage of experienced players and the goal percentage of non-experienced players exists.
Confidence Interval
From the previous test, we find that the difference between the goal percentage of experienced players and the goal percentage of not experienced players exists. However, do experienced players have a higher goal percentage probability than non-experienced players or lower? We decided to use a confidence interval to explore the relationship between the goal percentage of experienced players and the goal percentage of non-experienced players.
B: the experiment has binary outcomes, which are “scores a goal” and “doesn’t score a goal.”
I: assumed that all trials are independent
N: the sample size is fixed
S: assumed each trial has the same probability of success
\(p1\) is the probability that an experienced player makes a goal.
\(p2\) is the probability that a non-experienced player makes a goal.
\(X_1\) is the number of goals experienced player makes.
\(X_2\) is the number of goals non-experienced player makes.
the statistical model for \(X_1\) is \(X_1 ∣ p1 ∼ Binomial(733724,p)\)
the statistical model for \(X_2\) is \(X_2 ∣ p1 ∼ Binomial(1175564,p)\)
\[
\text{SE}(\hat{p}_1 - \hat{p}_2) =
\sqrt{ \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2} }
\]
Statistics Values for Confidence Interval
Less Experienced Players |
11986 |
733724 |
75501 |
More Experienced Players |
12816 |
1175564 |
116781 |
- Based on the data and formulas, we calculate the 95% confidence interval. The result we get suggests that we are 95% confident to say the goal probability for experienced players is anywhere from 0.44% lower to 0.27% lower than the non-experienced player. This result reflects that experienced players’ goal percentages tend to be slightly lower than non-experienced players’ goal percentages.
Discussion
Trend in Birthdays
- According to NHL, “all players who will be 18 years old on or before September 15 and not older than 19 years old before December 31 of the draft year are eligible for selection for that year’s NHL Entry Draft.” As a result, the players who are born in January will be the oldest players during selection time. It is known that under 27 years old, players’ physical condition increases as age increases so that players who are born in January will have the “best” physical condition in general among other players. So, players born closer to January tend to have a better physical condition, especially when in the younger leagues like high school and junior leagues, which makes them more likely to be selected as NHL players.
- Potential Short-comings: Our analysis can hardly interpret the distribution of players born in October, November, and December, so that can be other possible influencing factors like the financial condition of players or rules in younger leagues that we have not included in our analysis.
- In the future, we can use statistical methods to find other possible factors to explain the distribution of players’ birthday months. Moreover, we may group players by their positions to see the body features of players in different positions based on their birth months.
Body Features of NHL Players
- First, we find the median height of defense players is about the same as the upper quartile (the upper 25% of their heights) of other positions, which indicates that players in the defense position have a higher distribution of heights and are therefore more likely to be taller than other positions. Based on our research, we believe players with larger body sizes (heights) can have greater advantages in body conflict and preventing the attacks of opponents, which helps explain the height distribution of players at different positions.
- Next, there isn’t too much variation by position in weight, the median weights for Center, Forward, and Goaltender players are around the 3rd quartile (the lower 25% of their weights) of the rest of the positions’ (Defense, Left-Wing, Right-Wing) weight distributions, showing that players at Center, Forward, Goaltender are more likely to weigh less than players at other positions. Based on our research, we believe players with lower body weight can move faster, which helps explain the weight distribution of players at different positions we obtain (Center, Forward needs to conduct “quick attack”, Goaltender needs respond fast enough to intercept shots).
- Finally, we do not find any significant features through comparing the BMI of each position. Each position’s median BMI is at or above 25, which is defined as overweight by the CDC. Based on our research, we believe hockey is a sport involving many physical conflicts, so greater body “density” can be an advantage in the competition, which helps explain the BMI distribution we get.
- Potential Short-comings: we only provided a very general analysis of the body feature of NHL players instead of looking into details.
- In the future, we may collect body feature data frames of players in other sports like tennis, basketball, American football, etc. Explore the body feature of athletes in different sports and find a possible explanation.
- Different Methods: we can also use histograms to reflect the distribution of body features of NHL players at different positions.
Goal Percentage vs. Shooting Rate (Shots Taken per Second)
- Based on our scatterplot and correlation coefficient, shooting rate is a possible influencing factor of goal percentage. However, shooting rate does not have a significant effect on goal percentage; the correlation coefficient being 0.2410339 suggests a weak linear relationship between shooting rate and goal percentage. So, we conclude that we do not find a strong linear relationship between shooting rate (shots made per second) and goal percentage.
- Potential Short-comings: some players only have one shot recorded but successfully score (giving a 100% goal percentage). These “lucky” goals are outliers in our data frame but have a significant effect on our results, which becomes a potential shortcoming of our analysis.
- In the future, we may explore other factors that may influence the goal percentage to get a conclusion of ways to increase goal percentage.
Penalty Minute vs. Players’ Age
- As mentioned in background information, penalty minutes are mainly caused by physical conflict during the match. Younger people tend to be stronger, so with that, we suppose that players have less penalty minutes as NHL players’ age decrease. However, through our data analysis, we find when players’ ages are closer to 27 years old, their penalty minutes significantly increase. As players become younger or older than 27 years old, their penalty minutes significantly decrease. Based on scientific research on athletes, most athletes reach the best physical condition around 27. So, NHL players around age 27 will have better physical condition, which makes them more likely to conduct physical conflict to attack or defend during competitions, and physical conflict will cause high penalty minute.
- Potential Short-comings: there are some other factors like body features and temper that may influence the penalty minute of NHL players in each season year. Stronger players may be assigned by coach to conduct physical conflict, and players with bad temper also tend to conduct physical conflict, which may lead to high penalty minute.
- In the future, we may further explore other factors that may influence the penalty minute of NHL players.
- Different Methods: It is also possible for us to do linear regression analysis and make residual plots to address this problem.
Hypothesis Test
- From our hypothesis test, the p_value is 2.301138e-15 which is small enough to reject the \(H_0: {p}_{\text{experienced}} = {p}_{\text{not experienced}}\), so it is reasonable to conclude the difference between goal percentage of experienced players and goal percentage of non-experienced players exists.
- Potential Short-comings: we assume each shot of NHL players has the same probability of success (scoring a goal). However, in reality, each shot of NHL players during the competition may not have the same probability of success, which is actually influenced by many factors. Moreover, some “lucky” goals (only one shot but score successfully, 100% goal percentage) will also influence the value of p_pool, which also affects standard error, z score, and p_value.
- Our hypothesis test is not so perfect even though we draw a conclusion. In the future, we could find better statistical models to design the hypothesis test.
Confidence Interval
- The 95% confidence interval for the difference extends from the goal probability for experienced players being anywhere from 0.44% lower to 0.27% lower than the non-experienced player. This result is surprising because it is common sense that more experienced players tend to have a higher goal percentages than non-experienced players. However, the 95% confidence interval we obtain suggests experienced players have a lower goal percentage than non-experienced players. So, we do some internet research on this and find some possible influence factors that can explain our result: positions of experienced players (experienced players are too “old” so that they are not strong enough to play attack positions), experienced players’ training is not focused on shooting but defense and assistance, etc.
- Potential Shorting-comings: we again assume each shot of NHL players have the same probability of success (scoring goal). However, in reality, each shot of NHL players during the competition may not have the same probability of success, which is actually influenced by many factors.
- Different Methods: we can also use a hypothesis test with Ho: p1 <= p2, Ha: p1 > p2 for this problem. If we can reject Ho, then we can reach the same conclusion.
New Data
- Current data set is large enough for our analysis. We can look into more variables to refine our analysis and understanding.
- In the future, we may combine other data frames like “rank of teams in NHL” to help us have a better understanding of the team’s influence on players’ performance.
References
- The Data Frame We Used.
- Some Coding Ideas We Used.
- An Introduction and Background Information for Hockey Knowledge.
- The History of the NHL.
- The Definition and Meaning of BMI.
- The Rules of Hockey.
- The NHL Selection Rules.
- The Effect of Age on Players