-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcontract prediction 2023.qmd
1494 lines (1222 loc) · 95.2 KB
/
contract prediction 2023.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "NBA Free Agency Contract Prediction 2023"
author: "Sumitro Datta"
format: pdf
editor: visual
date: "`r Sys.Date()`"
execute:
cache: true
echo: true
warning: false
message: false
---
# Introduction
This is going to be my fourth year attempting to predict free agent contracts. What happened last year?
- **James Harden & Bradley Beal** both opted out of their player options & re-signed with their teams. Harden signed a [2-year, \$68.6M](https://www.nba.com/news/harden-agrees-to-2-year-68-6m-deal-with-76ers) contract including a player option with the Sixers. Beal signed a [five-year, \$251M](https://twitter.com/wojespn/status/1542629367779565569) maximum contract (including a no-trade clause) to stay with the Wizards.
- Restricted free agents **Deandre Ayton & Anfernee Simons** cashed in, signing long-term deals with their current teams. Ayton signed a [4-year, \$133M](https://twitter.com/ShamsCharania/status/1547737051751059457) offer sheet with the Pacers that was subsequently matched by the Suns. Simons agreed to a [4-year, \$100M](https://www.nba.com/news/anfernee-simons-2022-nba-free-agency) contract with the Blazers.
- Unrestricted free agents **Zach LaVine & Jalen Brunson** were the only other players to secure deals totalling more than \$100M. LaVine extended his tenure with the Bulls by agreeing to a [5-year, \$215M](https://www.nba.com/news/zach-lavine-2022-nba-free-agency) maximum contract. Jalen Brunson was the only player in the top 6 and one of 2 in the top 13 to switch teams, moving from Dallas to New York on a [4-year, \$104M](https://www.nba.com/news/jalen-brunson-2022-nba-free-agency) contract.
Top players holding player options include **Khris Middleton, James Harden & Kristaps Porzingis.** Middleton was a key part of the Bucks championship in 2021, but struggled with health this year. He had offseason wrist surgery causing him to miss the first 20 games of the season and he encountered some knee soreness that cost him eighteen games in December & January. Harden signed a short-term contract at less than his maximum amount to give the Sixers some cap flexibility; this allowed the team to sign Harden's former teammate PJ Tucker. Harden was the sidekick to eventual MVP Joel Embiid, but the Sixers bowed out in the second round of the playoffs to their hated rivals in the Boston Celtics. Porzingis had his first full season in Washington after being traded there in 2022 from the Mavericks, where the Luka-Zinger pairing had seemingly run its course. Porzingis had his healthiest season in five years (since his All-Star year in 2018), setting career highs in points per game and shooting efficiency.
In terms of restricted free agents, the top players are **Cameron Johnson, PJ Washington & Austin Reaves**. In his sophomore year, Johnson was the Suns sixth man on their run to the Finals in 2021. He has improved his efficiency as his field goal attempts have increased. Johnson was part of the trade bringing Kevin Durant to the desert, landing in Brooklyn with Mikal Bridges. Washington has spent the first four years of his NBA career in Charlotte as a steady big in a team usually dominated by its guards (Terry Rozier has been on the team for the same time as Washington, Devonte' Graham led the team in scoring in Washington's first year, and LaMelo Ball was drafted at number 3 in 2021). Reaves went undrafted in 2021 and signed a two-way deal with the Lakers. He was converted to a regular deal a month before the season started, and showed some promise in a season in which the Lakers missed the playoffs due to stars LeBron James & Anthony Davis missing large chunks of the season. Reaves leveled up in 2023, averaging the fifth-most minutes per game & points per game on the team.
On the unrestricted free agent side, we've got **Kyrie Irving, Nikola Vucevic & Fred VanVleet** topping the lists. Kyrie had a \$36.5M option last year that he picked up, [surprising even his own team](https://nba.nbcsports.com/2022/06/28/report-nets-learned-of-kyrie-irving-opting-in-with-his-public-statement/). He was named an All-Star for the eighth time in his career, but after talks stalled on an extension, Kyrie [requested a trade](https://www.nba.com/news/kyrie-irving-requests-trade) in early February. His wish was granted two days later when he was sent to the Dallas Mavericks. Vucevic was acquired at the 2021 trade deadline by the Bulls in a trade with the Magic to give Zach LaVine the best big man he had played with. After signing Lonzo Ball and DeMar DeRozan in the 2021 offseason, the Bulls looked to make some noise. Unfortunately, the 2022 Bulls were the 6th-seed and lost in the first round to the eventual champion Bucks, while the 2023 Bulls lost in the play-in. With Ball having a seemingly career-ending injury, perhaps the Bulls let Vucevic walk rather than locking themselves into a lower-end playoff team at best? VanVleet was a free agent back in 2020 on the [first iteration of this project](https://github.com/sumitrodatta/contract-prediction-2020/blob/master/contract-predict-markdown.pdf). Predictions came in at 4-years and \$93-97 million, and VanVleet ended up re-signing with the Raptors on a 4-year, \$85M contract with a player option on the final year. VanVleet was an All-Star in 2022, but also never shot better than 40.3% from the field in the past three years despite taking the second-most shots on the team.
What I wanted to do was predict what contracts this year's free agent class might get based off previous offseasons. Stars generally get star-type money on account of there being a maximum contract; but in tiers below, contracts of comparable players usually come up in discussing contract value.
# Methods/Analysis
## Loading Packages
Let's start by loading required packages.
```{r load_pkgs, results="hide"}
packages=c("tidyverse",
"tidymodels",
"janitor", # cleaning variable names
"glmnet", # multinomial regression
"ranger", # random forest algorithm wrapper
"kknn", # nearest neighbors algorithm wrapper
"rpart", # decision tree algorithm wrapper
"rpart.plot", #plot decision tree outputs
"kernlab", # svm kernel wrapper
"vip", # variable importance
"zoo", # rolling/window operations
"matrixStats",
"RColorBrewer",
"readxl", #read excel files
"ggdark", # dark background plots
"gt") # table creator (replace kableExtra & formattable)
for (pkg in packages){
if(!require(pkg,character.only = TRUE)){
install.packages(pkg,repos = "http://cran.us.r-project.org")
}
}
```
## Importing the Data
For the statistical data, I've scraped total and advanced stats from Basketball-Reference and stored them in .csv files. This was actually part of a larger project to scrape complete statistics for teams, players and awards (the Kaggle dataset resides [here](https://www.kaggle.com/sumitrodatta/nba-aba-baa-stats)). To my knowledge, my dataset is unique in that it includes BAA stats and ABA stats, which is not really of use here.
The advanced stats I kept were cumulative (offensive win shares, defensive win shares and value over replacement player). For players who played on multiple teams in one season, I kept their total stats and discarded the team-specific season portions. There was an initial desire to use totals to bake in availability/body fragility, but the shortened seasons would cause the model to declare all players to be fragile and underestimate their contract.
In the first iteration of this project, we scaled games played and games started to a normal distribution due to fluctuations in games played between seasons caused by the COVID-19 pandemic. We will convert the games started to a percentage of games played and we will change the games played to a percentage of maximum playable games. This maximum will differ for players who played for multiple teams in one season.
```{r load_stats}
#specify columns because otherwise birth year is read as logical
cols_for_stats=cols(
.default = col_double(),
player = col_character(),
pos = col_character(),
lg = col_character(),
tm = col_character()
)
advanced<-read_csv("Data/Advanced.csv",col_types = cols_for_stats) %>%
select(seas_id:mp,ows:ws,vorp) %>%
#players with 0 mp (rounded down) would have div by zero error
mutate(ws_per_48=if_else(mp==0,0,ws/mp*48),.before="vorp")
totals<-read_csv("Data/Player Totals.csv",col_types = cols_for_stats)
#max games per season for players on multiple teams
max_games_tots=totals %>% filter(tm=="TOT") %>% group_by(season,lg,tm) %>%
summarize(max_games_tot=max(g,na.rm = TRUE)) %>% ungroup()
#max games per season for players on single team
max_games=totals %>% filter(tm !="TOT") %>% group_by(season,lg) %>%
summarize(max_games_non_tot=max(g,na.rm = TRUE)) %>% ungroup()
#coalesce above two into one column in totals df
totals_enhanced=left_join(totals,max_games_tots) %>% left_join(.,max_games) %>%
mutate(max_games_playable=coalesce(max_games_tot,max_games_non_tot)) %>%
select(-c(max_games_non_tot,max_games_tot))
advanced_and_totals<-left_join(totals_enhanced,advanced) %>%
#if player played for multiple teams in season, only take total row
mutate(tm=ifelse(tm=="TOT","1TOT",tm)) %>%
group_by(player_id,season) %>% arrange(tm) %>% slice(1) %>%
mutate(tm=ifelse(tm=="1TOT","TOT",tm)) %>%
arrange(season,player) %>%
mutate(g_percent=g/max_games_playable,gs_percent=gs/g,.before=g) %>%
select(-c(gs,max_games_playable)) %>%
#filter since 1997 to match w/play-by-play + faster pre-processing
filter(season > 1996) %>% ungroup()
```
A comment was raised on an earlier iteration regarding positional scarcity. Teams will overpay for the potential piece that puts them "over the hump", whether that be into playoff contention or the more loftier goal of championship contention. Teams might also panic to acquire a player that is deemed to be the last one in a talent tier above the remaining free agents in the same position.
The play-by-play data (available since the 1997 season) keeps track of the percentage of minutes played a player has played in each traditional position (point guard, shooting guard, small forward, power forward & center). I converted the percentages to raw minutes played at each position, and summed across all teams a player played for in the same season. I felt this method was more accurate than using the "totals" row.
```{r load_pbp}
play_by_play<-read_csv("Data/Player Play By Play.csv") %>%
filter(tm!="TOT") %>%
select(seas_id:player,mp:c_percent)
#replace NA's with zeroes
play_by_play[is.na(play_by_play)] <- 0
pbp_pos_mins=play_by_play %>%
#convert percents to minutes played at position
mutate(across(pg_percent:c_percent,~./100*mp)) %>%
rename_with(.fn=~gsub(x = ., pattern = "_percent", replacement = "_mp"),
.cols=pg_percent:c_percent) %>%
#sum season minutes across different teams
group_by(season,player_id) %>%
mutate(across(mp:c_mp,sum,.names="{col}_summed")) %>%
slice(1) %>% ungroup() %>% select(-c(mp:c_mp))
```
Another concern that I noticed is that top tier players have depressed contract predictions. Due to the smaller number of players on the court and smaller rosters in general, an individual player's importance is heightened compared to other team sports. As such, elite players get paid the most money. To identify elite players, I brought in end-of-season All-Defense and All-NBA Team voting.
There are two All-Defense teams made up of 5 players each, honoring the top defensive players in the league. They have been voted on by media since 2014. Prior to that, NBA head coaches voted on the All-Defensive team recipients, with the caveat that they could not vote for players on their own team. Players get 2 points for a first-team vote, and 1 point for a second-team vote. There are three All-NBA teams made of 5 players each, honoring the top overall players in the league. The All-NBA teams are voted on by media, and were expanded to the current 3-team setup in 1989. Players get 5 points for a first-team vote, 3 points for a second-team vote and 1 point for a third-team vote. To normalize differences in total points received, I calculated a player's vote share by dividing their points received by the maximum points available (essentially if every possible voter gave that player a first-team vote).
The All-NBA voting is also pulled from Basketball-Reference from each season's awards voting section (here's [2023](https://www.basketball-reference.com/awards/awards_2023.html#all_leading_all_nba) for an example). Basketball-Reference very recently added All-Defense voting, so I compiled it myself. Since 2014, the NBA has published all voter ballots and I've scraped those PDF files for a separate [project](https://www.kaggle.com/datasets/sumitrodatta/nba-endofseason-voting-ballots-20142020). To get earlier results, I relied on [Patricia Bender's site](https://www.eskimo.com/~pbender/index.html) as well as newspapers.com user [iknowball](https://www.newspapers.com/profile/iknowball?tab=clippings).
```{r load_all_d_and_all_nba_shares}
all_defense_voting_since_2014=read_csv("Data/all-def.csv") %>%
pivot_longer(cols=first_fwd:second_g_2,names_to="points",values_to="player") %>%
#2023 awards have no comma between last name & first name
mutate(player=if_else(year==2023,str_replace(player," ",", "),player)) %>%
mutate(player=gsub("\\(.*","",player)) %>%
mutate(player=gsub("--.*","",player)) %>% mutate(player=str_trim(player)) %>%
separate(player,into=c("last","first"),sep=", ",convert = TRUE) %>%
unite("player",c(first,last),sep=" ",na.rm=TRUE) %>%
mutate(points_given=case_when(
(str_detect(points,"first"))~2,
(str_detect(points,"second"))~1,
)) %>%
mutate(player=case_when(
#3 players in all-def 2015 have one dash rather than two
str_detect(player," - IND")~"George Hill",
str_detect(player," - SA")~"Danny Green",
str_detect(player," - Mil")~"Giannis Antetokounmpo",
str_detect(player,"PJ Tucker")~"P.J. Tucker",
str_detect(player,"TJ McConnell")~"T.J. McConnell",
str_detect(player,"Ginobili")~"Manu Ginóbili",
str_detect(player,"Porzingis")~"Kristaps Porziņģis",
str_detect(player,"Jokic")~"Nikola Jokić",
str_detect(player,"Doncic")~"Luka Dončić",
str_detect(player,"Schroder")~"Dennis Schröder",
str_detect(player,"Robert Williams III")~"Robert Williams",
str_detect(player,"Michael Jr. Porter")~"Michael Porter Jr.",
str_detect(player,"Jr. Jaren Jackson")~"Jaren Jackson Jr.",
str_detect(player,"O.G. Anunoby")~"OG Anunoby",
TRUE~player)) %>%
#replace non-ascii dashes
mutate(player=str_replace(player,"\u2010","-")) %>%
mutate(vote_position=word(points,start=2,sep="_")) %>%
select(-points) %>% rename(season=year)
all_d_vote_shares_since_2014=all_defense_voting_since_2014 %>%
#points maximum is number of ballots * 2 points for first-team vote
#number of ballots is number of choices divided by 10 (5 1st team, 5 2nd team)
group_by(season) %>% mutate(pts_max=n()/10*2) %>%
group_by(season,player,pts_max) %>%
summarize(pts_won=sum(points_given),
x1st_tm=sum(points_given==2),
x2nd_tm=sum(points_given==1)) %>%
ungroup() %>% mutate(share=pts_won/pts_max)
all_d_vote_shares_until_2013=read_excel("Data/All-Defense Voting < 2013.xlsx")
all_d_vote_shares_since_1991=bind_rows(all_d_vote_shares_since_2014,
all_d_vote_shares_until_2013) %>%
select(season,player,all_defense_share=share)
all_nba_voting_shares=read_csv("Data/End of Season Teams (Voting).csv") %>%
filter(type=="All-NBA") %>%
select(season,player,seas_id,player_id,all_nba_share=share)
```
```{r clean_environ,include=FALSE}
rm(packages,pkg,cols_for_stats,
advanced,totals,max_games_tots,max_games,totals_enhanced,play_by_play,
all_defense_voting_since_2014,all_d_vote_shares_since_2014,
all_d_vote_shares_until_2013)
```
I decided to completely revamp the free agency training set. Rather than scrape the Basketball-Reference [free agents tracker](https://www.basketball-reference.com/friv/free_agents.fcgi), I decided to scrape the [Pro Sports Transaction Archive](https://www.prosportstransactions.com). The benefits to this decision were twofold:
- the Basketball-Reference free agents tracker only goes back to the 2016 offseason. The Pro Sports Transaction Archive allows me to add free agency periods before then.
- Basketball-Reference would mostly include players that were on team rosters on the final day of the regular season, excluding players who were waived during the season.
I started the new training set from the 2012 offseason, which was the first offseason of the new collective bargaining agreement. I kept the logic fairly similar from the previous training set. Players who signed from overseas and players who retired before the start of the next season were not included as free agents. The former would have no statistical data to pull, and the latter would artificially inflate the number of players who didn't sign a contract. Players whose salary & contract years were set to zero either went overseas, had explicitly non-guaranteed first years in their contracts (training camp deals, two ways, ten days, exhibit 10s), or didn't end up signing a contract before the season started. Option years & partially guaranteed years were included in the calculation of contract years; I looked at it as both player & team intending to see out the contract. The majority of year 1 salaries were gathered using Spotrac. Other minor sources include Basketball-Insiders, Basketball-Reference and Patricia Bender.
A couple of new wrinkles while expanding the training set:
- some players would sign multiple contracts, either due to failing a physical and voiding a contract, or converting from a non-guaranteed contract to a guaranteed one. I decided to take the first contract signed, as future contracts would have been based on additional information.
- how to handle restricted free agents whose offers were rescinded or not even offered? I could take the status as of contract signing (which would decrease the amount of RFA data to train on) or I could take the status as of the start of free agency (which could be misleading, as the player might not have gotten the same contract as an RFA). I decided on the former.
```{r load_train}
past_free_agents<-read_csv("Data/Past Free Agents.csv")
```
The next file I used was salary cap history, scraped from RealGM. To somewhat normalize comparisons across years, I converted the first year salary to a percentage of the salary cap.
```{r load_cap_hist}
#subtract one from year to match up with offseason in which contract was signed
salary_cap_hist<-read_csv("Data/Salary Cap History.csv") %>% mutate(season=season-1)
current_salary_cap=salary_cap_hist %>% filter(season==2023) %>% pull()
#create variable of first year salary as percentage of cap
#easier to compare across years
past_free_agents<-past_free_agents %>% select(-c(terms,Source)) %>%
left_join(.,salary_cap_hist) %>%
mutate(first_year_percent_of_cap=yr_1_salary/cap) %>%
select(-c(yr_1_salary,cap))
```
Quick question, who do you think has the highest recorded salary cap percentage in the dataset? If you chose one of LeBron James, Kevin Durant or Stephen Curry, you'd probably be surprised to find out that you're incorrect! The highest percentage belongs to Carmelo Anthony, who re-signed with the New York Knicks in 2014 for a first-year salary cap percentage of 35.61%. Due to an exception that ["the maximum salary in the first season of a contract is never less than 105% of the salary in the last year of the player's previous contract"](http://www.cbafaq.com/salarycap.htm#Q23), Carmelo was able to surpass the 35% league-wide maximum salary.
The last file loaded is our evaluation set: the 2023 free agent class, retrieved from Spotrac. I had to edit this dataset to match the Basketball-Reference names (mainly adding diacritics to European names). In addition, I filtered out players with options. Players who decline player options and players who have their team options declined with more than 3 years of experience become unrestricted free agents. Players with less than or equal to 3 years of experience and a declined team option become restricted free agents. I'll use this fact to see which players & teams might decline their option.
```{r load_eval}
current_fa<-read_csv("Data/Free Agents 2023.csv")
#separate out options to compare what players options get if declined
current_fa_options<-current_fa %>% filter(str_detect(type,"Player|Club")) %>%
select(-c(experience,contract_yrs)) %>%
rename(option_type=type,option_amt=first_year_percent_of_cap)
#make player options all declined (UFA's)
#make club options ufa or rfa depending on exp
current_fa<-current_fa %>%
mutate(type=case_when((type=="Player"|(type=="Club" & experience >= 4))~"UFA",
(type=="Club" & experience < 4)~"RFA",
TRUE~type)) %>%
group_by(player) %>% select(-experience) %>% slice(1) %>% ungroup() %>%
mutate(first_year_percent_of_cap=NA)
```
In the GitHub repository where this project is located, a file called `free agents.r` has more details on exactly how I scraped the train set, evaluation set and the salary cap history.
## Retrospective on Last Year's Results
Before getting into pre-processing, we'll take a look at last year's results and see how the models performed.
```{r last_years_results_load, echo=FALSE}
non_options=read_csv("https://raw.githubusercontent.com/sumitrodatta/contract-prediction-2022/master/Non-Option%20Contracts.csv")
options=read_csv("https://raw.githubusercontent.com/sumitrodatta/contract-prediction-2022/master/Options.csv")
combined_predictions=bind_rows(non_options,options) %>%
clean_names() %>%
#take out totals
select(-starts_with("total")) %>%
select(-c(age,type)) %>%
mutate(across(ends_with("cap_percent"),~parse_number(.)/100)) %>%
mutate(x2021_option=parse_number(x2021_option)) %>%
#correct names
mutate(player=case_when(str_detect(player,"Otto")~"Otto Porter Jr.",
str_detect(player,'Clax')~"Nic Claxton",
str_detect(player,"Danuel")~"Danuel House Jr.",
str_detect(player,"Lonnie")~"Lonnie Walker IV",
str_detect(player,"Woodard")~"Robert Woodard II",
TRUE~player)) %>%
#join predictions with actual
left_join(.,past_free_agents %>% filter(season==2022)) %>%
select(-c(season,option_type,type)) %>%
#filter out players who either picked up player option or had team option picked up
filter(is.na(x2021_option)|!is.na(contract_yrs)) %>%
select(-x2021_option) %>%
replace_na(list(contract_yrs=0,first_year_percent_of_cap=0)) %>%
mutate(contract_yrs=factor(contract_yrs,levels=0:5),
yrs_y1s2=factor(yrs_y1s2,levels=0:5),
yrs_s1y2=factor(yrs_s1y2,levels=0:5)) %>%
mutate(across(contains("cap"),~round(.,digits=4)))
rm(non_options,options)
```
```{r accuracy,echo=FALSE}
y1s2_yr_acc=round(accuracy(combined_predictions,truth=contract_yrs,estimate=yrs_y1s2) %>% pull(.estimate)*100,2)
s1y2_yr_acc=round(accuracy(combined_predictions,truth=contract_yrs,estimate=yrs_s1y2) %>% pull(.estimate)*100,2)
```
The years accuracy of the years-first model was `r y1s2_yr_acc`%, while the years accuracy of the salary-first model was `r s1y2_yr_acc`%. The 2020 models were at 49-51% accuracy, but the 2021 models were at 58-60% accuracy. There's been a decline in accuracy after the initial improvement in 2021. Here's some confusion matrices on how each model handled the prediction of contract years.
```{r years_confusion_matrix,echo=FALSE}
#| layout-ncol: 2
y1_yr_heatmap=combined_predictions %>%
conf_mat(data=.,truth=contract_yrs,estimate=yrs_y1s2) %>%
autoplot(type="heatmap") +
ggtitle("Actual Contract Years vs Predicted Contract Years",
subtitle="Years-First Model")
ggsave(filename = "Images/Years Predict v Actual, Y1S2 Model.png",plot=y1_yr_heatmap)
s1_yr_heatmap=combined_predictions %>%
conf_mat(data=.,truth=contract_yrs,estimate=yrs_s1y2) %>%
autoplot(type="heatmap") +
ggtitle("Actual Contract Years vs Predicted Contract Years",
subtitle="Salary-First Model")
ggsave(filename = "Images/Years Predict v Actual, S1Y2 Model.png",plot=s1_yr_heatmap)
y1_yr_heatmap
s1_yr_heatmap
```
The incorrect predictions were pessimistic, in that they skewed toward predicting less years than received as evidenced by the sum of the upper triangle being greater than the lower triangle (models forecasting no contract for players who received a one-year contract, one year for players who got two years, etc).
Here are the worst misses for both models.
```{r last_yr_extreme_yr_misses,echo=FALSE}
#| layout-ncol: 2
yrs_s1y2_misses=combined_predictions %>%
arrange(desc(contract_yrs),desc(yrs_s1y2)) %>%
mutate(abs_error_y1s2_yrs=abs(as.numeric(contract_yrs)-as.numeric(yrs_y1s2))) %>%
slice_max(abs_error_y1s2_yrs,n=5) %>% select(player,yrs_y1s2,contract_yrs)
yrs_y1s2_misses=combined_predictions %>%
arrange(desc(contract_yrs),desc(yrs_y1s2)) %>%
mutate(abs_error_s1y2_yrs=abs(as.numeric(contract_yrs)-as.numeric(yrs_s1y2))) %>%
slice_max(abs_error_s1y2_yrs,n=5) %>% select(player,yrs_s1y2,contract_yrs)
gt(yrs_s1y2_misses)
gt(yrs_y1s2_misses)
```
On June 29, 2022 (one day before NBA free agency opened), Bridges was arrested in Los Angeles for [felony domestic violence](https://www.latimes.com/california/story/2022-06-29/nba-miles-bridges-reportedly-arrested-in-los-angeles-on-suspicion-of-domestic-violence) and was released on \$130,000 bond. Beal was the main reason I decided to bring in end-of-season team voting shares, because I immediately knew he would be getting more than 2 years in a contract offer. Čančar and Hauser re-signed with the Nuggets and Celtics respectively.
```{r last_yr_results,echo=FALSE}
y1_rmse=rmse(combined_predictions,
truth=first_year_percent_of_cap,
estimate=y1s2_cap_percent) %>%
pull(.estimate)
s1_rmse=rmse(combined_predictions,
truth=first_year_percent_of_cap,
estimate=s1y2_cap_percent) %>%
pull(.estimate)
```
Let's shift our focus to the salary predictions. First, the residual mean squared error of the years-first model was `r y1_rmse`, while the salary-first model had an RMSE of `r s1_rmse`. However, this includes Miles Bridges not receiving a contract when he was predicted for 23.55% of the salary cap under Y1S2 and 21.92% of the salary cap under S1Y2.
```{r last_yr_results_wo_miles_bridges,echo=FALSE}
y1_rmse_revised=rmse(combined_predictions %>% filter(player != "Miles Bridges"),
truth=first_year_percent_of_cap,
estimate=y1s2_cap_percent) %>%
pull(.estimate)
s1_rmse_revised=rmse(combined_predictions %>% filter(player != "Miles Bridges"),
truth=first_year_percent_of_cap,
estimate=s1y2_cap_percent) %>%
pull(.estimate)
```
Removing Bridges, the new years-first RMSE is `r y1_rmse_revised`, while the new salary-first RMSE is `r s1_rmse_revised`. Last year's Y1 RMSE was 0.0245718 and S1 RMSE was 0.0248204. Even excluding the outlier of Bridges, the RMSEs are greater.
As we did with the years models, let's look at the most extreme salary misses.
```{r last_yr_extreme_sal_misses,echo=FALSE}
sal_y1s2_misses=combined_predictions %>%
filter(player != "Miles Bridges") %>%
mutate(abs_error_y1s2_salary=abs(first_year_percent_of_cap - y1s2_cap_percent)) %>%
slice_max(abs_error_y1s2_salary,n=10) %>%
select(player,y1s2_cap_percent,first_year_percent_of_cap)
sal_s1y2_misses=combined_predictions %>%
filter(player != "Miles Bridges") %>%
mutate(abs_error_s1y2_salary=abs(first_year_percent_of_cap - s1y2_cap_percent)) %>%
slice_max(abs_error_s1y2_salary,n=10) %>%
select(player,s1y2_cap_percent,first_year_percent_of_cap)
gt(sal_y1s2_misses) %>% fmt_percent(columns=y1s2_cap_percent:first_year_percent_of_cap)
gt(sal_s1y2_misses) %>% fmt_percent(columns=s1y2_cap_percent:first_year_percent_of_cap)
```
Common players missed on both models are Bradley Beal, Anfernee Simons, Deandre Ayton, Montrezl Harrell & Marvin Bagley III. Simons, Ayton & Bagley were all restricted free agents. Simons re-upped with the Trail Blazers after stepping into the scoring void left by CJ McCollum in his trade to the Pelicans in February 2022. Ayton signed an offer sheet with the Pacers that was subsequently matched by the Suns. Bagley was never able to live up to the billing of being the number 2 overall pick in 2019, especially with the 3 picks immediately after him being Luka Doncic (4x First-Team All-NBA, 4x All-Star), Jaren Jackson Jr. (2x First-Team All-Defense, 1x Defensive Player of the Year, 1x All-Star) and Trae Young (1x Third-Team All-NBA, 2x All-Star). He was traded from the Kings to the Pistons in January 2022, and re-signed with the Pistons in the offseason. Harrell signed with Philadelphia to back up eventual 2023 Most Valuable Player Joel Embiid and pursue a championship.
On a more positive note, here's some players on which the models were very close on. We'll restrict our view to contracts that had a first year salary that was greater than 5% of the salary cap, as it's easier to get close to minimum & near-minimum contract amounts.
```{r last_yr_extreme_sal_close,echo=FALSE}
sal_y1s2_close=combined_predictions %>%
filter(contract_yrs %in% c(1:5) &
yrs_y1s2 %in% c(1:5) &
first_year_percent_of_cap > 0.05) %>%
mutate(abs_error_y1s2_salary=abs(first_year_percent_of_cap - y1s2_cap_percent)) %>%
slice_min(abs_error_y1s2_salary,n=10) %>%
select(player,y1s2_cap_percent,first_year_percent_of_cap)
sal_s1y2_close=combined_predictions %>%
filter(contract_yrs %in% c(1:5) &
yrs_y1s2 %in% c(1:5) &
first_year_percent_of_cap > 0.05) %>%
mutate(abs_error_s1y2_salary=abs(first_year_percent_of_cap - s1y2_cap_percent)) %>%
slice_min(abs_error_s1y2_salary,n=10) %>%
select(player,s1y2_cap_percent,first_year_percent_of_cap)
gt(sal_y1s2_close) %>% fmt_percent(columns=y1s2_cap_percent:first_year_percent_of_cap)
gt(sal_s1y2_close) %>% fmt_percent(columns=s1y2_cap_percent:first_year_percent_of_cap)
```
Kyle Anderson is the first instance in the 3 years of predictions of an exact match with the actual salary percent. Players on both top 10s are P.J. Tucker, Delon Wright, Isaiah Hartenstein, Jae'Sean Tate and Cody Martin. Only one of the top 10 closest salary hits exceeds 10% as an actual first year salary percentage.
```{r remove_retrospective_vars,include=FALSE}
rm(y1_rmse,s1_rmse,y1_yr_heatmap,s1_yr_heatmap,s1y2_yr_acc,y1s2_yr_acc,combined_predictions,
sal_s1y2_close,sal_s1y2_misses,sal_y1s2_close,sal_y1s2_misses,yrs_s1y2_misses,
yrs_y1s2_misses,s1_rmse_revised,y1_rmse_revised)
```
## Data Exploration and Visualizations
Before we train models, let's see how some of the predictors and some of the targets interact. First, let's see how our targets correlate with each other. I've created a box and whisker plot, as well as added the points themselves in a transparent layer with some random variation to differentiate between points.
```{r targets_corr, echo=FALSE, fig.align='center',out.width="100%"}
set.seed(1) #to make jitter reproducible
corr_yrs_sal=round(
cor(past_free_agents$contract_yrs,
past_free_agents$first_year_percent_of_cap),4)
yrs_vs_sal=past_free_agents %>%
ggplot(aes(x=factor(contract_yrs),y=first_year_percent_of_cap)) +
geom_boxplot() + geom_jitter(alpha=0.1,width=0.2) +
labs(x="Contract Years",y="First Year Cap %") +
scale_y_continuous(labels = scales::percent) +
annotate("text",x=1.25,y=0.3,label=paste0("Correlation Coeff:\n",corr_yrs_sal)) +
dark_theme_gray()
ggsave(filename = "Images/Contract Years Against First Year Salary.png",plot=yrs_vs_sal,
width=7.5,height=3.75,units = "in")
knitr::include_graphics("Images/Contract Years Against First Year Salary.png")
```
The correlation coefficient is `r corr_yrs_sal*100`%, which shows that the two targets are strongly and positively correlated. The median value of the first year cap % (middle line in each box) increases with an increase in contract length. The highest increase in first year cap % is between a 4-year and 5-year contract.
Next, let's see if an all-encompassing advanced statistic has a relationship with first year cap percentage. Win Shares represent how much a player has contributed to his team's wins by comparing his output to a marginal player. Higher win shares generally indicate a better player.
```{r ws_vs_salary, echo=FALSE, fig.align='center',out.width="100%"}
free_agent_ws=left_join(past_free_agents,
advanced_and_totals %>% select(seas_id:birth_year,ws)) %>%
#multiple free agents with same name in same season
filter(!(player=="Tony Mitchell" & birth_year==1992 & season==2014)) %>%
filter(!(player=="Chris Johnson" & birth_year==1985 & season==2013))
corr_ws_sal=round(
cor(free_agent_ws$contract_yrs,
free_agent_ws$ws),4)
ws_vs_sal=free_agent_ws %>%
ggplot(aes(x=first_year_percent_of_cap,y=ws,color=factor(contract_yrs))) +
geom_point() +
scale_colour_brewer(palette="RdYlGn") +
labs(x="First Year Cap %",y="Win Shares",color="Contract Years") +
annotate("text",x=0.05,y=12.5,
label=paste0("Correlation Coeff:\n",corr_ws_sal)) +
dark_theme_gray()
ggsave(filename = "Images/Win Shares Against First Year Salary.png",plot=ws_vs_sal,
width=7.5,height=3.75,units = "in")
knitr::include_graphics("Images/Win Shares Against First Year Salary.png")
```
With a correlation coefficient of `r corr_ws_sal*100`%, win shares are highly correlated with first year cap percentage. This shouldn't be too groundbreaking: better players get paid more. The player with the highest Win Shares to not get a contract is Miles Bridges in 2022 at 7.2, who we covered earlier. The players with the lowest win shares to get a five-year contract are Bradley Beal & Luguentz Dort in 2022 with 1.4 Win Shares (Beal got a maximum contract at 35% of the salary cap, while Dort received 12.4%).
\newpage
Finally, let's see how many contracts of each length were given in each offseason.
```{r contracts_by_season, echo=FALSE, fig.align='center',out.width="100%"}
contract_summary=past_free_agents %>%
group_by(season,contract_yrs) %>% summarize(num_contracts=n()) %>% ungroup() %>%
group_by(season) %>% mutate(tot_contracts=sum(num_contracts),
percent_of_contracts=num_contracts/tot_contracts) %>%
ungroup()
contract_per_seas=contract_summary %>%
ggplot(aes(x=factor(season),fill=factor(contract_yrs))) +
geom_bar(aes(y=percent_of_contracts),
position="dodge",stat="identity") +
labs(x="Season",y="Percent of Contracts Given",fill="Contract Years") +
scale_fill_brewer(palette="RdYlGn") +
scale_y_continuous(labels = scales::percent) +
dark_theme_gray()
ggsave(filename = "Images/Contract Lengths by Offseason.png",plot=contract_per_seas,
width=7.5,height=4,units = "in")
knitr::include_graphics("Images/Contract Lengths by Offseason.png")
```
It comes as no surprise that as contract length increases, the percent of contracts of that length given out decreases. 2012, 2014, 2017, 2020 & 2021 all follow this descending pattern. There are some outliers:
- 4-year contracts almost doubling 3-year contracts in the reckless spending offseason of 2016 (the year of the cap spike, when the salary cap jumped from \$70 million to \$94 million)
- 2-year contracts outnumbering 1-year contracts in 2013, 2019 and 2022
```{r contracts_by_years_signed, echo=FALSE, fig.align='center',out.width="100%"}
avg_yr_contract_summary=contract_summary %>%
group_by(contract_yrs) %>%
summarize(avg_percent_of_contracts=mean(percent_of_contracts))
contracts_by_years_signed=avg_yr_contract_summary %>%
ggplot(aes(x=factor(contract_yrs),y=avg_percent_of_contracts,fill=factor(contract_yrs)))+
geom_bar(stat="identity", show.legend = FALSE)+
labs(x="Contract Years",y="Avg Percent of Contracts Given") +
scale_fill_brewer(palette="RdYlGn") +
geom_text(aes(label=scales::percent(avg_percent_of_contracts,accuracy=0.01)),
position=position_dodge(width=0.9), vjust=-0.25)+
scale_y_continuous(labels = scales::percent) +
dark_theme_gray()
ggsave(filename = "Images/Average Contract Lengths.png",plot=contracts_by_years_signed,
width=7.5,height=4,units = "in")
knitr::include_graphics("Images/Average Contract Lengths.png")
```
Compiling all the seasonal data in the previous graph, we see an average offseason has 50% of free agents not signing a contract. The biggest drop is from 2-year contracts to 3-year contracts.
```{r clean_plots,include=FALSE}
rm(yrs_vs_sal,ws_vs_sal,contract_per_seas,contracts_by_years_signed,
corr_ws_sal,free_agent_ws,contract_summary,avg_yr_contract_summary)
```
\newpage
## Pre-Processing
### Pre-Processing Stats & Vote Shares
I used regular season stats, although I do understand that some players get paid on the strength of playoff performance. I started off with contract year stats, because there's anecdotal evidence that players exert more effort in their contract year (*cough cough Hassan Whiteside*). All stats except for the advanced stats (OWS, DWS and VORP) were converted to per game. Percentages were left alone.
In addition to using contract year stats and vote shares, I summed the past two years and the contract year.
Why I settled on 3 years:
- Players do get paid on past performance, so just using contract year stats was out of the question
- 2 years opens up the possibility of a fluke year
- Kawhi would have his nine game 2018 season bring down his averages significantly from his 2019 season with the Raptors: adding another year somewhat lessens this effect
- On the other hand, it's quite unlikely that teams factor in stats from more than 4 years ago, a lot would have changed
- the Celtics didn't pay Blake Griffin to recapture his form of his 2019 All-Star year in Detroit (I would hope)
- Another reason I settled on 3 years is that I can keep the same model for restricted free agents
- my thought is that the rookie year is a bonus: great if you did well, but doesn't matter in the grand scheme of things if you did poorly
- rookie extension is more based on how you improved over the course of that initial contract
- For example, if Ja Morant had a worse rookie year but had the same level of play that he has achieved in his second and third year (as well as next year), I highly doubt that Memphis would have offered him a significantly less amount of money due to that substandard rookie year
I performed the same processing on the three-year totals, using the three-year game total as the denominator for converting to per game. I had to calculate the three-year percentages, and also re-engineered the win shares per 48 minutes metric.
I removed categories that were linear combinations of one another. For example, total rebounds can be found by simply adding up offensive and defensive rebounds, and points are just the result of 2\*(number of 2 point field goals made) & 3\*(number of 3 point field goals made). I kept age and experience as predictor variables, but removed position because I felt it would ultimately reflect in the stats themselves.
I divided the three-year versions of games played percentage, games started percentage, All-NBA voting share and All-Defense voting share by 3 to rescale the columns back to a 0-1 range. For example, James Harden received a 100% vote share in 3 consecutive seasons from 2017-2019. Rather than showing all_nba_share_last_3_yrs as 1+1+1=3 in his 2019 row, dividing by 3 gives us the more easily understood 100% of possible points received in the last 3 years.
The final step was to replace missing values in the shooting percentages with zeroes. These NA's were originally due to lack of attempts.
```{r pre_process}
stats_and_shares=advanced_and_totals %>%
left_join(.,all_nba_voting_shares) %>%
left_join(.,all_d_vote_shares_since_1991) %>%
replace_na(list(all_nba_share=0,all_defense_share=0))
three_year_rolling_stats_and_shares=stats_and_shares %>% group_by(player_id) %>%
#three year sum
mutate(across(-c(1:9,fg_percent,x3p_percent,
x2p_percent:e_fg_percent,ft_percent),
list(three_yrs=~rollapplyr(.,3,sum,partial=TRUE)),
.names="{col}_last_3_yrs")) %>%
mutate(ws_per_48_last_3_yrs=ws_last_3_yrs/mp_last_3_yrs*48) %>%
mutate(fg_percent=ifelse(fga==0,0,fg/fga),
x3p_percent=ifelse(x3pa==0,0,x3p/x3pa),
x2p_percent=ifelse(x2pa==0,0,x2p/x2pa),
e_fg_percent=ifelse(fga==0,0,(fg+0.5*x3p)/fga),
ft_percent=ifelse(fta==0,0,ft/fta)) %>%
mutate(fg_percent_last_3_yrs=
ifelse(fga_last_3_yrs==0,0,fg_last_3_yrs/fga_last_3_yrs),
x3p_percent_last_3_yrs=
ifelse(x3pa_last_3_yrs==0,0,x3p_last_3_yrs/x3pa_last_3_yrs),
x2p_percent_last_3_yrs=
ifelse(x2pa_last_3_yrs==0,0,x2p_last_3_yrs/x2pa_last_3_yrs),
e_fg_percent_last_3_yrs=
ifelse(fga_last_3_yrs==0,0,
(fg_last_3_yrs+0.5*x3p_last_3_yrs)/fga_last_3_yrs),
ft_percent_last_3_yrs=
ifelse(fta_last_3_yrs==0,0,ft_last_3_yrs/fta_last_3_yrs)) %>%
#remove categories that aren't predictive vars or linear combo of others
select(-c(lg,pos,birth_year,tm,
trb,trb_last_3_yrs,
fg,fga,fg_last_3_yrs,fga_last_3_yrs,
pts,pts_last_3_yrs)) %>%
#convert contract year and last 3 year stats to per game (except games)
mutate(across(c(mp,x3p:x3pa,x2p:x2pa,ft:fta,orb:pf),list(per_game=~./g)),
.after="gs_percent") %>%
select(-c(g,mp,x3p:x3pa,x2p:x2pa,ft:fta,orb:pf,ws)) %>%
mutate(across(mp_last_3_yrs:pf_last_3_yrs,list(per_game=~./g_last_3_yrs)),
.after="gs_percent_last_3_yrs") %>%
select(-c(g_last_3_yrs,mp_last_3_yrs:pf_last_3_yrs,ws_last_3_yrs)) %>%
ungroup() %>%
#rescale games percentages & vote shares over 3 years back to 0-1
mutate(across(c(g_percent_last_3_yrs:gs_percent_last_3_yrs,
all_nba_share_last_3_yrs,
all_defense_share_last_3_yrs),~./3)) %>%
replace_na(list(fg_percent=0,x3p_percent=0,x2p_percent=0,
e_fg_percent=0,ft_percent=0))
```
### Pre-Processing Positions
I took a three-year rolling sum of minutes played for consistency with the previous stats pre-processing, and converted the totals back to percents.
With 3-year positional percentages in hand, it was time to assign the actual positions. Some players play one position almost exclusively, while other players are more flexible and alternate between positions. I set the baseline for a player to be considered at a "pure position" at 75%: if the player played at any position more than 75% of the time, they were deemed to be that position.
All other players were bucketed into combo positions based on the maximum of the following sums of two traditional positions:
- combo guard (point guard/shooting guard)
- small wing (shooting guard/small forward)
- big wing (small forward/power forward)
- big man (power forward/center)
Some players had multiple combo positions listed due to a small amount of minutes played and two players had no positions listed due to playing less than a full minute, so those players were added manually.
```{r pre_process_positions}
pbp_last_three_years=pbp_pos_mins %>% group_by(player_id) %>%
#rolling 3 year sum of minutes played total & position
mutate(across(mp_summed:c_mp_summed,
list(three_yrs=~rollapplyr(.,3,sum,partial=TRUE)),
.names="{col}_last_3_yrs")) %>% ungroup() %>%
select(-c(mp_summed:c_mp_summed)) %>%
#convert totals back to percents
mutate(across(pg_mp_summed_last_3_yrs:c_mp_summed_last_3_yrs,
~./mp_summed_last_3_yrs)) %>%
rename_with(.fn=~gsub(x = ., pattern = "_mp", replacement = "_percent"),
.cols=pg_mp_summed_last_3_yrs:c_mp_summed_last_3_yrs) %>%
select(-mp_summed_last_3_yrs)
#assign pure positions to players with 75% of time at one position
pure_position=pbp_last_three_years %>%
pivot_longer(.,cols=c(pg_percent_summed_last_3_yrs:c_percent_summed_last_3_yrs),
names_to = "pos",values_to = "percent_at_pos") %>%
group_by(seas_id) %>% slice_max(percent_at_pos) %>% ungroup() %>%
filter(percent_at_pos>0.75) %>% mutate(pos=word(pos,1,sep="_"))
combo_position=
#remove players with pure positions
anti_join(pbp_last_three_years,pure_position %>% select(1:4)) %>%
#create buckets of in-between positions
mutate(combo_guard=pg_percent_summed_last_3_yrs+sg_percent_summed_last_3_yrs,
small_wing=sg_percent_summed_last_3_yrs+sf_percent_summed_last_3_yrs,
big_wing=sf_percent_summed_last_3_yrs+pf_percent_summed_last_3_yrs,
big_man=pf_percent_summed_last_3_yrs+c_percent_summed_last_3_yrs) %>%
select(-c(pg_percent_summed_last_3_yrs:c_percent_summed_last_3_yrs)) %>%
pivot_longer(.,cols=c(combo_guard:big_man),
names_to = "pos",values_to = "percent_at_pos") %>%
group_by(seas_id) %>% slice_max(percent_at_pos) %>% ungroup() %>%
#4 players had 2 separate combo positions listed
filter(!(player=="Terrance Roberson" & season==2001 & pos=="big_wing") &
!(player=="Ty Jerome" & season==2020 & pos=="small_wing") &
!(player=="Anthony Gill" & season==2021 & pos=="big_man") &
!(player=="Chris Duarte" & season==2022 & pos=="combo_guard") &
!(player=="Jordan Hall" & season==2023 & pos=="combo_guard"))
all_player_pos=bind_rows(pure_position,combo_position) %>%
#2 players had 0 MP, so they were lost in both combo & pure
add_row(seas_id=19914,season=2006,player_id=3589,
player="Alex Scales",pos="sg",percent_at_pos=1) %>%
add_row(seas_id=22403,season=2010,player_id=3882,
player="JamesOn Curry",pos="pg",percent_at_pos=1) %>%
filter(season>2009) %>% select(-c(seas_id,percent_at_pos)) %>%
mutate(pos_group=case_when(pos %in% c("pg","sg","combo_guard")~"guard",
pos %in% c("sf","small_wing","big_wing")~"wing",
pos %in% c("pf","c","big_man")~"big"))
```
### Combining Pre-Processing Dataframes
I joined the positional data and grouped by season and positional group to get the percentage of VORP a player has contributed to their position as a proxy for positional scarcity. While a player may not have had a raw high VORP compared to all offseasons, they could have a significant proportion of their positional group's VORP in that specific offseason and possibly induce a bidding war between teams due to unappetizing other options.
I changed contract years from a numeric column to a factor/category column. This changes its prediction from a regression problem to a classification problem. A 2.5-year contract doesn't make much sense, so it is in our best interest to discretize the years and store them as factors rather than round a regression result.
```{r train_set}
train_set=inner_join(three_year_rolling_stats_and_shares,past_free_agents) %>%
#multiple free agents with same name in same season
filter(!(player=="Tony Mitchell" & seas_id==25056)) %>%
filter(!(player=="Chris Johnson" & seas_id==23991)) %>%
left_join(.,all_player_pos) %>% group_by(season,pos_group) %>%
mutate(position_vorp=sum(vorp_last_3_yrs)) %>% ungroup() %>%
mutate(percent_of_pos_vorp=vorp_last_3_yrs/position_vorp) %>%
mutate(contract_yrs=factor(contract_yrs,levels = 0:5)) %>%
select(-c(pos,pos_group,position_vorp)) %>%
mutate(across(-c(seas_id:experience,type:contract_yrs),~round(.,digits=4)))
write_csv(train_set,"Data/Train Set.csv")
```
## Training Models
There is no need for a subset of the training set to be withheld as a test set before running the models on the evaluation set, because there is built-in cross validation. In the first iteration of this project, I utilized leave-one-out cross validation since the dataset is relatively small (\<1000 observations). How this works is that the model is run excluding one observation. Then, the model attempts to predict the result of that excluded observation. This is repeated for every observation. However, on subsequent runs as the dataset has grown, I've decided to use k-fold cross validation. I've achieved even better results while cutting down significantly on training time.
As we saw in data visualization, the two target variables (contract years and first year salary as a percentage of the salary cap) are fairly well correlated, as they have a Pearson correlation coefficient of `r corr_yrs_sal`. The way I chose to handle this is:
- predict one target first without the other as a predictor
- choose the best model (be that a single model or an ensemble of multiple models)
- use the first target's predictions as an input to predict the second target
One potential problem is compounding errors. If there's an incorrect year prediction, it might lead to an incorrect salary prediction. Initially, to alleviate this problem, I thought it would be sufficient to utilize tidymodels' ability to run multivariate models with more than one outcome. However, I realized that tuning wasn't yet implemented with multi-output regression, and I'm uneasy about choosing arbitrary parameters. An additional (minor) drawback is the outcomes must be of the same type, so we would undo the contract years conversion done in the last section.
### The Models
I used a total of six models.
- linear regression model as a baseline for salary, and multinomial regression as a baseline for years
- the separation is due to the classification/regression split
- k-nearest neighbors model: take the distance between the statistics of two players (the absolute value of the difference) and then take the average of the outcome variable of the k nearest neighbours
- the intuition being that similar players get similar contracts
- decision tree model (`rpart`): maybe as a player passes certain statistical thresholds, their contract increases
- only using for predicting the contract years; since there are so many different salary percentages, a solitary decision tree would either be useless or far too complicated
- random forest model (`ranger`): reduces instability by averaging multiple trees
- costs interpretability as there is no tree diagram that is representative of the decisions made
- support vector machine model: attempt to separate classes with a hyperplane
- support vectors are the points closest to the hyperplane, named as such because the hyperplane would change if those points were removed
- I believe the following image from Wikipedia succinctly explains an SVM
```{r svm_explain, out.width="35%", fig.align='center', fig.cap="H1 does not separate the classes. H2 does, but only with a small margin. H3 separates them with the maximal margin. By User:ZackWeinberg, based on PNG version by User:Cyc - This file was derived from: Svm separating hyperplanes.png, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=22877598",echo=FALSE}
knitr::include_graphics("Images/SVM2.png")
```
\newpage
### Predicting Years First, then Salary
```{r actual_values,include=FALSE}
actual_values=train_set %>%
select(seas_id:player,contract_yrs,first_year_percent_of_cap)
```
```{r model_table_func,include=FALSE}
get_yrs_metrics<-function(model,model_name,tuning_var,train_df){
f_meas=tuning_var %>% show_best(n=1) %>% pull(mean)
contract_yr_predict_vec=predict(model,new_data=train_df) %>% pull()
retrain_acc=accuracy_vec(truth=actual_values$contract_yrs,
estimate=contract_yr_predict_vec)
off_by_more_than_one=sum(abs(as.numeric(as.character(actual_values$contract_yrs))-
as.numeric(as.character(contract_yr_predict_vec)))>1)/
length(contract_yr_predict_vec)
num_max_yr_predicts=sum(as.numeric(as.character(contract_yr_predict_vec))==5)
return(tibble(Method=model_name,
"Correct Predict %"=retrain_acc %>%
label_percent(accuracy=0.01)(),
"Off by >1 Yr"=off_by_more_than_one %>%
label_percent(accuracy = 0.01)(),
"Max Year Predicts"=num_max_yr_predicts,
"F1 Score"=f_meas))
}
get_sal_metrics<-function(model,model_name,tuning_var,train_df){
metrics_df=left_join(tuning_var %>% select_best(metric="rmse"),
tuning_var %>% collect_metrics)
resid_mean_sq_error=metrics_df %>% filter(.metric=="rmse") %>% pull(mean)
mean_abs_error=metrics_df %>% filter(.metric=="mae") %>% pull(mean)
sal_predict_vec=predict(model,new_data=train_df) %>% pull()
prediction_diffs=abs(actual_values$first_year_percent_of_cap-sal_predict_vec)
off_by_more_than_five_percent=sum(prediction_diffs>0.05)/length(prediction_diffs)
within_two_percent=sum(prediction_diffs<=0.02)/length(prediction_diffs)
return (tibble(Method=model_name,
"Off By >5%"=off_by_more_than_five_percent %>%
label_percent(accuracy=0.01)(),
"Within 2%"=within_two_percent %>%
label_percent(accuracy=0.01)(),
MAE=mean_abs_error,RMSE=resid_mean_sq_error))
}
```
```{r pretty_variable_importance,include=FALSE}
prettify_vip<-function(model,title){
importance_plot = vip(extract_fit_parsnip(model),scale=TRUE,
mapping = aes(fill = Variable)) +
dark_theme_gray() + theme(legend.position = "none") + ggtitle(title)
ggsave(paste0("Images/Variable Importance Plots/",title,".png"),plot=importance_plot,
width=7.5,height=3.75,units = "in")
knitr::include_graphics(paste0("Images/Variable Importance Plots/",title,".png"))
}
```
We'll start by predicting years first and then salary with years as an input.
With caret, everything could be thrown into the train function: cross-validation, tuning, data, etc. Tidymodels is more explicit in its steps. The standard 10-fold cross validation is how the data is going to be resampled. In addition, we'll stratify selection, so there is sufficient data to predict all contract year lengths. A hypothetical (albeit unlikely) scenario that could occur without stratification is a fold could include all instances of restricted free agents (which are more rare) in the test portion, and none in the train portion. The chosen algorithm would be confused as to how to predict something for which it has no training.
Recipes is tidymodels' pre-processing package. The first step to change the roles of some variables. The season_id, season, player_id & player variables are of no use as predictors, but are useful to identify observations, so they are given the "id" role. Since we are predicting years first, the first_year_percent_of_cap variable will be removed. Finally, we will convert the free agent type column to a numeric column.
```{r y1_steps}
set.seed(100)
cv=vfold_cv(train_set,v=10,strata=type,repeats=5)
y1_recipe=recipe(contract_yrs~.,data=train_set) %>%
update_role(seas_id:player,new_role="id") %>%
step_rm(first_year_percent_of_cap) %>% step_dummy(type)
```
Initially, accuracy was chosen as the metric to determine the best submodel by cross-validation. However, with the inherent imbalance of the outcome classes, the F1 score is a better metric. As an extreme example, if there were only two classes with a 90:10 split, a classifier could achieve 90% accuracy by simply predicting the more populous class for every case. On the other hand, the F1 score attempts to minimize both false positives & false negatives. However, the default averaging for F1 is macro-averaging, which gives equal weights to all classes and is exactly the problem we were trying to distance ourselves from by not choosing accuracy. We need to change the averaging function to be macro-weighted. In order to include it within yardstick's metric_set, we have to create a new function wrapping the metric, set the options within the wrapper & formalize it as a new metric.
```{r macro_weight_f_meas}
macro_weight_f1 <- function(data,truth,estimate,beta = 1,
estimator="macro_weighted",na_rm = TRUE,
event_level = yardstick_event_level(),...){
f_meas(data=data,
truth = !! rlang::enquo(truth),
estimate = !! rlang::enquo(estimate),
estimator = "macro_weighted",
beta=beta,
na_rm=na_rm,
event_level=event_level,
...
)
}
macro_weight_f1 <- new_class_metric(macro_weight_f1,"maximize")
```
First up is the multinomial regression model.
```{r multinom_yrs_indep}
wf=workflow() %>% add_recipe(y1_recipe) %>%
add_model(multinom_reg(penalty=0) %>%
set_engine("glmnet") %>%
set_mode("classification"))
tune_multinom=wf %>% tune_grid(resamples=cv,
metrics=metric_set(macro_weight_f1))
wf=wf %>% finalize_workflow(tune_multinom %>% select_best())
fit_y1_multinom=fit(wf,train_set)
```
Let's get the most important variables of the multinomial model.
```{r multinom_y1_top_vars,echo=FALSE, fig.align='center',out.width="100%"}
prettify_vip(fit_y1_multinom,"Multinomial Y1 Model Variable Importance")
```
The contract-year all-defense share dominates the variable importance, with the next best variable being less than 3% as important.
\newpage
Next up is the k-nearest neighbors model.
```{r knn_yrs_indep}
wf=workflow() %>% add_recipe(y1_recipe) %>%
add_model(nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>% set_mode("classification"))
tune_knn=wf %>% tune_grid(resamples=cv,
grid=expand.grid(neighbors=5:50),
metrics=metric_set(macro_weight_f1))
wf=wf %>% finalize_workflow(tune_knn %>% select_best())
fit_y1_knn=fit(wf,train_set)
tune_knn %>% select_best() %>% gt()
```
The best knn model is near the upper limit of the tuning grid, taking into account 45 of the closest comparable players.
The next model is the decision tree model. We can tune the cost complexity parameter (cp) in this model. CP is the minimum for how much the residual sum of squares must improve for another partition to be added. A CP that is too high will have too few branches, while a CP that is too low will be difficult to follow since there are many branches. We lean towards the lower end of the spectrum. We will also change the min_n, which is the minimum number of data points in a node that are required for the node to be split further. The default is 20, but the 5-year portion of the dataset is small, so we will decrease min_n to 1. Since there exists an element of randomness in choosing samples to model a decision tree, we need to set a seed to keep the work reproducible.
```{r decision_tree_yrs_indep}
set.seed(100,sample.kind = "Rounding")
wf=workflow() %>% add_recipe(y1_recipe) %>%
add_model(decision_tree(cost_complexity = tune(),min_n=tune()) %>%
set_engine("rpart") %>% set_mode("classification"))
tune_tree=wf %>% tune_grid(resamples=cv,
grid=expand.grid(cost_complexity=0.1^seq(2,5),
min_n=seq(1:10)),
metrics=metric_set(macro_weight_f1))
wf=wf %>% finalize_workflow(tune_tree %>% select_best())
fit_y1_tree=fit(wf,train_set)
tune_tree %>% select_best() %>% gt()
```
Let's get the most important variables of the decision_tree model.
```{r tree_y1_top_vars,echo=FALSE, fig.align='center',out.width="100%"}
prettify_vip(fit_y1_tree,"Decision Tree Y1 Model Variable Importance")
```
The individual contract-year components of the holistic advanced stat of win shares rank within the top-4 most important variables. Durability and availabilty are also highly desirable characteristics, as evidenced by the high values of contract-year minutes per game, contract-year games played percentage and last-3-years games played percentage.
```{r decision_tree, echo=FALSE, fig.align='center',out.width="60%"}
rpart.plot(pull_workflow_fit(fit_y1_tree)$fit,
roundint = FALSE,legend.x=NA,extra=0)
```
The decision tree does not predict any 5-year contracts. The decision tree maximizes its prediction at 4 contract years when a player does all of the following:
- has defensive win shares above 0.55 in the contract year
- plays more than 27 minutes per game in the contract year
- has a value over replacement player above 0.75
The next model is a random forest. Since `ranger` is a *random* forest algorithm, we need to set a seed to keep the work reproducible.
Random forest algorithms require an explicit call for variable importance, so we'll ask for permutation importance. A simplified explanation for permutation importance is shuffling a predictor's values and seeing how much the error increases. As a predictor's importance increases, it is difficult for the rest of the model to compute accurate predictions without it. There are 3 tuning parameters:
- trees is the number of decision trees to create
- the default is 500, which we'll keep
- mtry is the number of variables to split at each decision node
- the default is the rounded square root of the number of variables, which in this case would be `round(sqrt(55))`
- we will try all integers between 5 & 10
- min_n is the same as the decision tree
- the default is 1 for a classification problem like this, which we'll keep
```{r rand_forest_yrs_indep}
set.seed(100,sample.kind = "Rounding")
wf=workflow() %>% add_recipe(y1_recipe) %>%
add_model(rand_forest(mtry = tune()) %>%
set_engine("ranger",importance="permutation") %>% set_mode("classification"))
tune_forest=wf %>% tune_grid(resamples=cv,
grid=expand.grid(mtry=5:10),
metrics=metric_set(macro_weight_f1))
wf=wf %>% finalize_workflow(tune_forest %>% select_best())
fit_y1_forest=fit(wf,train_set)
tune_forest %>% select_best() %>% gt()
```
```{r forest_y1__top_vars,echo=FALSE, fig.align='center',out.width="100%"}
prettify_vip(fit_y1_forest,"Random Forest Y1 Model Variable Importance")
```
\newpage
Random forest variable importance has a more gradual decline than either decision tree or multinomial. The contract year versions of the holistic advanced stats are in the top 5 most important variables in the random forest model, with the other 2 spots taken by both versions of the games played percentage.
Finally, we train the support vector machine on the model. There are two tuning parameters:
- rbf_sigma: determines how well to fit the training data (higher=fit closer to training)
- with a high sigma, every workaday big with middling stats might be predicted to get close to 2016 Mozgov money
- cost: tradeoff between smoothness of boundaries and correct classification
- with a high cost, leads to too wiggly of a boundary, and might not generalize to test sets
- tests using C=0.25, C=0.5 and C=1
```{r svm_yrs_indep}
set.seed(100,sample.kind = "Rounding")
wf=workflow() %>% add_recipe(y1_recipe) %>%
add_model(svm_rbf(rbf_sigma=tune(),cost=tune()) %>%
set_engine("kernlab") %>% set_mode("classification"))
tune_svm=wf %>% tune_grid(resamples=cv,
grid=expand.grid(rbf_sigma=0.1^seq(2:5),
cost=c(0.25,0.5,1)),
metrics=metric_set(macro_weight_f1))
wf=wf %>% finalize_workflow(tune_svm %>% select_best())
fit_y1_svm=fit(wf,train_set)
tune_svm %>% select_best() %>% gt()
```
Unfortunately, there is no concept of variable importance for an SVM model.
Let's look at some performance metrics, and see which models we want to take along as inputs for predicting salary.
```{r y1_metrics,echo=FALSE}
models<-get_yrs_metrics(fit_y1_multinom,"Multinomial",tune_multinom,train_set)
models<-bind_rows(models,
get_yrs_metrics(fit_y1_knn,"KNN",tune_knn,train_set),
get_yrs_metrics(fit_y1_tree,"Decision Tree",tune_tree,train_set),
get_yrs_metrics(fit_y1_forest,"Random Forest",tune_forest,train_set),
get_yrs_metrics(fit_y1_svm,"SVM",tune_svm,train_set))
gt(models)
```
The SVM has the highest F1 score, but the lowest accuracy. The random forest stands far above the rest with \~97% accuracy! In 2020's project, the random forests had the best performance as well, but had difficulty distinguishing 5-year contracts. Shifting to a classification problem has solved that. The multinomial model had significantly more max-year predictions than KNN as well as the highest F1-score barring the SVM, whereas KNN had \~4% better accuracy. While anticlimactic, I think the best course of action is to simply use the random forest model to make contract-year predictions. We'll convert the prediction column to use it as a numeric predictor.
```{r y1_predict_for_s2}
train_set_after_y1=train_set %>% select(-contract_yrs) %>%
bind_cols(contract_yrs=
as.numeric(
as.character(
predict(fit_y1_forest,new_data=train_set) %>% pull()
)
)
)
cv=vfold_cv(train_set_after_y1,v=10,strata=type,repeats=5)
s2_recipe=recipe(first_year_percent_of_cap~.,data=train_set_after_y1) %>%
update_role(seas_id:player,new_role="id") %>% step_dummy(type)
```
Now I can run through the models for salary using the predicted years as an input. The models I won't reuse are the multinomial model and the decision tree. The multinomial model is for classification only, while we are attempting to predict a continuous outcome in the first year salary as a percentage of the salary cap. We'll sub in a linear regression model as our new baseline. Since there's so many different salary percentages, a decision tree model would be useless or far too complicated.
```{r lin_sal_dep}
wf=workflow() %>% add_recipe(s2_recipe) %>%
add_model(linear_reg() %>%
set_engine("lm") %>%
set_mode("regression"))
tune_lin=wf %>% tune_grid(resamples=cv,metrics=metric_set(rmse,mae))
wf=wf %>% finalize_workflow(tune_lin %>% select_best(metric="rmse"))
fit_s2_lin=fit(wf,train_set_after_y1)
```
```{r lin_s2_top_vars,echo=FALSE, fig.align='center',out.width="100%"}
prettify_vip(fit_s2_lin,"Linear S2 Model Variable Importance")
```
Contract years dwarf all other variables in terms of importance, with both versions of All-NBA share showing up in the top 10 (last-3-years showing up 2nd with contract-year farther down at number 8).
```{r knn_sal_dep}
wf=workflow() %>% add_recipe(s2_recipe) %>%
add_model(nearest_neighbor(neighbors=tune()) %>%
set_engine("kknn") %>%
set_mode("regression"))
tune_knn=wf %>% tune_grid(resamples=cv,
grid=expand.grid(neighbors=5:50),
metrics=metric_set(rmse,mae))
wf=wf %>% finalize_workflow(tune_knn %>% select_best(metric="rmse"))
fit_s2_knn=fit(wf,train_set_after_y1)
tune_knn %>% select_best(metric="rmse") %>% gt()
```
The best knn salary model is around the middle of the tune grid, using 24 comparables.
```{r forest_sal_dep}
set.seed(100,sample.kind = "Rounding")
wf=workflow() %>% add_recipe(s2_recipe) %>%
add_model(rand_forest(mtry = tune()) %>%
set_engine("ranger",importance="permutation") %>% set_mode("regression"))
tune_forest=wf %>% tune_grid(resamples=cv,
grid=expand.grid(mtry=5:10),