Implementing Boost Stat — Shot Quality
In January, I joined Boost Sport, a sports analytics startup focusing on NCAA Division I basketball, as a data science intern. As a company aiming to enable customized insights for high stakes decisions for coaches, we want to implement metrics that’re more than just box scores to reflect the game — we call these Boost Stats. Shot Quality is an important part of these stats.
The Problem
Shot Quality measures how good a shot selection is — is the right player taking the right shot at the right position? With Boost’s Play-By-Play data, we have every detail of a shot to determine its quality: Where is the shot taken? Is it an alley-oop/dunk/step-back jumpshot? Is shot taken in transition/2nd chance? Using all this information, together with a fair estimate of players’ rating at this particular shot, we can give a decent judgement to whether a shot is a good selection or not.
There are a bunch of factors that can affect the result of a shot. Question is, how do we assign the weights to all these factors when calculating the shot quality, which could be a very subjective metric? For example, some coaches prefer 3 pointers more than mid-range shots, so they give higher shot quality ratings to 3pt attempts; on the other hand, some coaches might be more conscious about whether it’s a star or role player taking the shot. As a data scientist, my take on the problem is — let a machine learning model decide!
Instead of manually assigning the weights, we decided to feed all the data into a tree-based binary classification model and let the model determine which factors have more influcence on whether a shot is a make/miss. With the model, we can predict the chance of every shot going in; multiplying that by how many points the shot entitles, we get expected points of the shot, and we use it as the final output of shot quality metric. This would be the most objective way to define the quality of a shot.
Data Preparation & Feature Engineering
We have ~3M rows and 30+ features to train our model. Before we start the modeling process, we want to extract more information from raw data and transform them to new features to get better results from the model.
From our raw data, we have x,y coordinates of shots. This is very useful information, but we can get more from it. For instance, by calculating x,y coordinates of two rims on the court and how much away each shot is from its attempted rim, we can get a new predictor shot distance, which is more interpretable by the model. Furthermore, we can also assign a shot zone to each shot, by calculating its distance to each of 15 Boost-defined shot zones.

Besides the attributes of the shot itself, we also need information of the player, to determine the quality of a shot selection. For example, even it’s a wide open shot opportunity in the corner, if it’s a player who never shoots from 3 expanding his range, we would not want to assign a high rating to this shot.
A straightforward measurement of a player’s shooting ability is Field Goal%(FG%) — the player makes how many percentage of shots taken. However, it could a fairly misleading estimate when players have not taken a lot of shots. Let’s say player A shoots only 5 times from the 3-point range and makes 2 of them(40 FG%), we can’t conclude that he is a better shooter than player B, who is 38/100(38 FG%) from same range. To summarize, when we don’t have enough samples of shots from a player, whether his FG% is high or low, we don’t have much information or confidence to conclude whether he’s a good shooter or not.
On the other hand, when a player has taken many shots, we tend to believe that he is a good shooter. Why? Because the coach allows him to do so! Good players are played more and have more shot opportunities on the court.
To account for the fact that as Field Goal Attempts increases, average FG% also increases(see black dots in following graph), we can apply beta-binomial regression to estimate a player’s true shooting ability.

The methodology here is to assign a prior based on how many shots a player has taken, and further adjust it with his raw FG%. For example, for a player that shoots 45/100(45%) from 3 point range, we assign him a prior of 360/1000(36%), assuming that’s the result from beta-binomial regression, for a player shoots 100 times. Then we could adjust the prior with his raw FG% to get a final estimate, (45+360)/(100+1000)=36.82%. Notice that the player’s adjusted FG% is boosted from the prior we assign to him, as he shoots a higher percentage than we expect him to.
Modeling
Now we have a standard binary classification problem — if a shot is a make or miss. A variety of models are available for choice, and I chose a random forest model for production, as it gives the best result on validation set.
I didn’t use the standard 80/20 train/validation split, as the ultimate goal of the model is to predict the chance of made shots at the current season. Therefore, I used all previous seasons as my training set, and the current season as validation set.
One notable issue here is data leakage, as raw FG% and adjusted FG% we calculated is a direct output from the target variable. This can be viewed as an application of target encoding. Thus, in the training set, I performed stratified K-Fold splitting on player_id, and use only out-of-fold information at each fold.
The validation set is a bit trickier to handle. Normally we could simply use the mean of target from the training set. However, as I chose validation set to be the current season, there will be shots taken by new players that have not played in the previous seasons. It is safe to assign these players the average of other players at the beginning of season, and it complies with real-world setting — when a freshmen player first shows up, we don’t know a lot about whether he’s good or not. But as season goes, we start to have more knowledge of these players, it would make less sense to keep using other players’ average to estimate them. So I split the validation set to 5 subsets, ordered by time, and calculate raw FG% and adjusted FG% at each fold using the training set the all previous validation subsets.
Results & Applications
After we finish model tuning, we can output expected point of each shot for production and examine high-level results.

This is a shot heatmap of all shots taken in 2021 NCAA Division I basketball season. Shots around the rim and 3-pointers are most efficient shots, according to our Shot Quality model.
We can also perform analysis on the team level, so that coaches can have a better idea that during what part of the season, what period in a game, which players tend to take good/bad quality shots.

For example, this graph represents GA Tech’s shot quality consistency across the season. The Yellow Jackets had a slow start back in December, but they bounced back to above conference level and remained consistent in the remainder of the season.

This graph represents GA Tech’s shot quality consistency in different periods in games. We can tell that they are a second-half team — their expected point per shot jumps to 1.18 from their first-half average 1.13.
Thanks for reading!
Ricky Zhang
M.S. Data Science @ USFCA. Graduating in August.
- If you enjoyed this and want to talk, let’s connect on LinkedIn!