The Probability Peers prospect model (PPer) for NHL Drafting

By: Connor Jung

Introduction: Why drafting is so important

The introduction of the salary cap in the 2005-2006 season ushered in the era of Choice Theory for the NHL. A principle of Microeconomics, Choice Theory looks at choice optimization given a set of budget constraints. That’s just a fancy way of saying that NHL organizations have to strike a difficult balance: build championship teams within the salary constraints imposed by the CBA.

This has made cost-effective asset management a priority across the league. In effect, it’s possible cap space has crowded out unrestricted free agents (a topic I would love to explore at a later date), because the prices that they typically command in free agency make them luxuries rather than essentials. Likewise, exploring arbitrage opportunities in the trade market means you have to give something to get something and you’re often trading for a player whose organization gave up on them (sorry, Goldy). Instead, GM’s have to focus on making sizeable gains at the NHL entry draft. Given the heightened importance created by this era to take advantage of cost-effective ELCs, it has become more important than ever to draft well.

Measuring Prospect Success

I’ve spent some time as a scout at the ISS (albeit at the Midget level) and something you hear constantly from rink to rink is “Oh, he’s going to be a player.” In my opinion, that loosely translates to “smart player who will find ways to play at the next level no matter their scoring role.”

But how do we identify those players and / or that skill?

Determining a definition of “prospect success” is important, because that’s what you’re ultimately trying to model. There has been research done on finding the value (or opportunity cost) of picks, so NHL GMs are constantly thinking about risk/reward when it comes to selecting players. One measure of prospect success in prior work looks at players who have played 200 games in the NHL, irrespective of scoring role, as a benchmark for an NHL player. Therefore our goal is predicting whether a player is “going to be a player” and eclipse 200 games played.

Prior Work

The methodology I chose is inspired by the pGPS model (an extension of the now defunct PCS model) by looking at the scoring of junior level prospects across leagues and attempting to group them by cohorts.

The main components of pGPS aim to look at players who are alike, based on league, height, weight and adjusted points per game, and then observe how successful that cohort is at producing NHL players. A probability is assigned to a player in the cohort by dividing the total successful NHL players of the cohort by the total number of matches. The cohorts are limited by the sample of players that appear in the same league as each other. I would like to open the peer group to all similar players regardless of league.

Introducing “Probability Peers” AKA “PPer’s”

For the purposes of this study, I’m introducing a term I call “Probability Peers”, because I feel any new ‘analytic’ in 2019 requires an acronym for its name, we’ll go with “PPer” for short. While the PPer prospect model was influenced by pGPS and its predecessors, the model diverges in a few ways.

PPer is a regression based approach. Instead of using points per game to predict NHL success, I use the following features:

  • Age

  • League

  • Era Adjusted Goals / Game

  • Era Adjusted Assists / Game

  • % of Team’s Goals / Game

  • % of Team’s Assists / Game

I wanted to tease out goals and assists from raw adjusted points, and to include some team level scoring metric. Often times you’ll hear scouts say “Oh that player played on a bad team, if they played on a contender they would have put up more points”. By including how much a player’s individual scoring contributes to their team’s overall scoring, the model should capture those types of cases.

Creating a cohort system is what made pGPS so understandable – it gives analysts an immediate set of names to match against the observed player. I wanted to see if I could create a cluster system that could address the limitations of the pGPS methodology as well as traditional scouting comparables. The latter often focuses too much on Nationality, League and recency bias tends to creep in: “This guy plays like Patrick Kane” -- I don’t know how many times I’ve heard this over the years.

Since the probabilities of each player is dependent on the pGPS model creating matches, my goal was to create a “peer group” after modelling that player’s probability. This should create a bigger or more accurate sample of possible matches extending beyond a player’s league and points per game. In saying that, the PPer clusters may not be as explainable or intuitive as the pGPS model.

Another difficulty when creating “peer group” is that height and weight are highly variable – especially when dealing with 17/18 year olds. Public data (in my case from is updated constantly, and a player’s height and weight are typically updated throughout their careers. Comparing an 18 year old’s combine biometrics to an established NHL’er at 25 is not going to return accurate matches.

Since we know that players grow and add weight as they develop, I created a simple regression model to estimate what a player’s playing height and weight would be if they made it to the NHL. I did this by looking at NHL combine data from 2012-2016 to model a player’s height and weight when they were 22-23 years old. The results were interesting – let’s take a look at Elias Pettersson of the Vancouver Canucks and what his expected playing measurements would be:

Now that we have NHL probabilities and expected measurements, we can cluster our players to create the “Peer group” of the PPer model. Similarity scores are estimated by looking at players’ height and weight and NHL probability.

Model Results

Using a simple logistic regression, the model appears to separate the data pretty well in the training set (2006-2013). The model’s log-loss for the test set (2014-2019) is much lower as most players have not had the chance to reach the 200 game threshold.

For the clustering, I chose to use K-means as it is the most interpretable. Using a random set of centroids (total clusters), I tried to find the optimal set of player peer groups trying to minimize the model’s intertia:


This allowed me to settle on 100 clusters for both Forwards and Defense positions.

Using peer clusters and graphing probability and draft overall makes a visual that would make skittle lovers happy:

The composite probability is a weighted average of that player’s last 3 observed seasons (80/15/5) and thus giving us one final probability number to evaluate. What was interesting was clustering player probability and draft position using the same methodology as the Peer method arrived at the same 100 clusters per position and looks way more aesthetically pleasing:

But we’re here for player peer groups, not pretty graphs, so here’s peer groups of the top 2 rated players in this upcoming draft: Jack Hughes and Kaapo Kakko. Remember that snarky comment earlier about scouts comparing players to Patrick Kane, well…

As for Kakko:

Model shortcomings

There are three main pitfalls that I see with this model:

  1. A player’s probability is heavily influenced by the league they play in. Since the AJHL doesn’t produce nearly as many NHLers as the Liiga, players who play in the former will on average receive much lower probabilities than player’s who can play in the Liiga based on the sample.

  2. It’s possible there is overfitting, but out-of-sample validation is tough because some players drafted in 2013-2018 have not had the opportunity to amass 200 games.

  3. I’m treating each player season as an independent observation in the model, but its a weak assumption given that a player tomorrow will be very much influenced by the player they were today. Historical composite probabilities also include years after a player is drafted, and perhaps peer groups should include age.

Next Steps

It’s important to know what this model is for. In my opinion, it isn’t supposed to be used to separate players ranked 1-20, but perhaps players ranked 10-200. Building a team requires efficiency at the draft table in order to potentially unearth a diamond in the rough. This model should be able to identify a player who scores at a decent clip, who may be on a bad team, and cluster their peers giving analysts and scouts a starting point before diving into a player’s finer details. If there are any questions, you can comment below or reach out to me on social media, as this model will be consistently improved/iterated on as we progress into the summer.

As we move closer to the NHL Draft, the team will be doing player profiles of players that we consider underrated that could be value on draft day. Stay tuned!