STAT334 Final Project

code
analysis
Author

Ben Moolman

Published

May 11, 2024

Abstract

In this project, we aim to explore NCAA basketball teams and their results from the 2023-2024 season. We use data that has various statistics from the 2023-2024 NCAA basketball regular season. We also use data that has predictions made by the public as to how each team will do in the 2024 NCAA tournament. Our goal is to combine these data sets into a shiny app and offer tables and visuals to explore how teams have performed this last season, and how the public thought they would perform in the 2023-2024 NCAA tournament. The code for the shiny app is in the app.R file in the STAT334_final_project repository here.

Introduction

Data

We are going to look at two separate datasets in this project.

The first is the “College Basketball Dataset” from Kaggle by Andrew Sundberg, which contains data from various seasons of Division 1 college basketball. At the time of this writing, the csv contains data from the 2023-2024 regular season without any data from the NCAA tournament this year. We are going to be using the cbb24.csv for our analysis, as it contains data on all Division 1 NCAA college basketball teams for the 2023-2024 regular season. The link to the Kaggle dataset can be found here. Variables from this dataset are:

  • RK: The ranking of the team at the end of the regular season according to barttorvik
  • TEAM: The name of the team
  • CONF: The conference the team is in
  • G: The number of games the team played in the regular season
  • W: The number of wins the team had in the regular season
  • ADJOE: Adjusted offensive efficeincy (points scored per 100 possessions vs avg D1 defense)
  • ADJDE: Adjusted defensive efficiency (points allowed per 100 possessions vs avg D1 offense)
  • BARTHAG: Power rating from barttorvik (chance of beating avg D1 team)
  • EFG_O: Effective field goal percentage shot
  • EFG_D: Effective field goal percentage allowed
  • TOR: Turnover percentage allowed (turnovers per 100 plays)
  • TORD: Turnover percentage forced (turnovers per 100 plays)
  • ORB: Offensive rebound percentage
  • DRB: Defensive rebound percentage
  • FTR: Free throw rate
  • FTRD: Free throw rate allowed
  • 2P_O: Two point percentage shot
  • 2P_D: Two point percentage allowed
  • 3P_O: Three point percentage shot
  • 3P_D: Three point percentage allowed
  • ADJ_T: Adjusted tempo (possessions per 40 minutes)
  • SEED: The seed the team was given in the 2024 NCAA tournament
# load in data
library(readr)
library(here)
cbb24 <- read_csv(here("data/cbb24.csv"))
library(tidyverse)
# let's add a win percentage variable to cbb24
cbb24 <- cbb24 |> mutate(win_perc = W / G * 100)

Let’s look at what the four 1 seeds in the tournament’s rows look like in the cbb24 dataset.

RK TEAM CONF G W ADJOE ADJDE BARTHAG EFG% EFGD% TOR TORD ORB
1 Houston B12 34 30 119.2 85.5 0.9785 49.7 44.0 13.7 24.7 36.9
2 Connecticut BE 34 31 127.1 93.6 0.9712 57.1 45.1 14.9 16.2 36.5
3 Purdue B10 33 29 126.2 94.7 0.9644 56.0 47.7 16.5 14.0 37.4
9 North Carolina ACC 34 27 116.8 93.2 0.9305 51.3 46.4 14.4 14.9 32.8
DRB FTR FTRD 2P_O 2P_D 3P_O 3P_D ADJ_T WAB SEED win_perc
30.2 29.9 39.0 48.4 43.4 34.7 30.0 63.3 10.6 1 88.23529
26.8 33.3 32.5 58.5 43.7 36.7 31.9 64.6 11.3 1 91.17647
24.7 42.8 23.0 53.2 48.1 40.8 31.4 67.6 11.0 1 87.87879
23.5 36.8 28.3 50.3 46.0 35.4 31.4 70.4 6.6 1 79.41176

Our second dataset is taken from Github’s tidytuesday, and is from Nishaan Amin’s Kaggle dataset and analysis linked here. The tidytuesday task specified two of Nishaan Amin’s many datasets, and the link to the Github site can be found here. These two dataframes contain data on past team results and the predictions the public has for this year’s tournament (year 2024). The datasets are titled team-results and public-picks. For this project, we will only be looking at the public-picks data. Variables from this dataset are:

  • YEAR: The year of the NCAA tournament
  • TEAMNO: The team number
  • TEAM: The name of the team
  • R64: The percentage of the public that picked the team win in the Round of 64
  • R32: The percentage of the public that picked the team win in the Round of 32
  • S16: The percentage of the public that picked the team win in the Sweet 16
  • E8: The percentage of the public that picked the team win in the Elite 8
  • F4: The percentage of the public that picked the team win in the Final Four
  • FINALS: The percentage of the public that picked the team win in the Finals
# load in data
library(tidytuesdayR)
tuesdata <- tidytuesdayR::tt_load('2024-03-26')

    Downloading file 1 of 2: `team-results.csv`
    Downloading file 2 of 2: `public-picks.csv`
public_picks <- tuesdata$'public-picks'

Let’s look at the rows of the public_picks dataset that contain the four 1 seeds in the tournament.

YEAR TEAMNO TEAM R64 R32 S16 E8 F4 FINALS
2024 1067 Connecticut 98.41% 93.59% 80.22% 66.20% 50.27% 34.92%
2024 1056 Houston 97.01% 85.69% 63.77% 42.54% 16.15% 9.27%
2024 1038 North Carolina 97.55% 88.07% 71.16% 45.40% 29.98% 12.10%
2024 1033 Purdue 97.08% 86.21% 66.25% 44.36% 21.64% 10.22%

Goals

Our goal is to create a shiny app that allow users to explore the data from the 2023-2024 NCAA basketball season. We offer tables and visuals that allow users to see how teams performed in the regular season, and how the public thought they would perform in the 2024 NCAA tournament.

We are trying to answer the question of what can guide a team to having a successful regular season, and how that success translates to the NCAA tournament. We look at various statistics from the regular season data to see if there are any trends that can be seen in the data. We also look at the public picks data to see if there are any trends in the data that can be seen. We can see if there are any specific statistics or groupings of statistics that the public tends to pick more often than others.

Visualizations

First, we do some cleaning of the data to make it easier to work with. We are going to be looking at the tables and graphs generated in the shiny app. The code below generating all visualizations is static and will not be run in the shiny app. It is only here to show the visualizations that will be in the shiny app.

library(tidyverse)
library(shiny)

## let's add a win percentage variable to cbb24
cbb24 <- cbb24 |> mutate(win_perc = W / G * 100)

## Let's convert the percentages to numeric in public picks
public_picks[, c("R64", "R32", "S16", "E8", "F4", "FINALS")] <- 
  lapply(public_picks[, c("R64", "R32", "S16", "E8", "F4", "FINALS")], 
         function(x) as.numeric(sub("%", "", x)))

Regular Season Statistics Plot

## Visualization 1: cbb24 filtered for specific teams and specific stats
cbb24_top_teams <- cbb24 |> 
  filter(TEAM %in% c("Houston", "Connecticut", "Purdue", "North Carolina")) |>
  select(TEAM, win_perc, '2P_O', '2P_D', '3P_O', '3P_D')

cbb24_top_teams_long <- pivot_longer(cbb24_top_teams, 
                                     cols = -TEAM, 
                                     names_to = "Variable", 
                                     values_to = "Value")

ggplot(cbb24_top_teams_long, aes(x = Variable, y = Value, fill = TEAM)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Variable", y = "Value",
       title = "Performance Comparison of Top College Basketball Teams",
       subtitle = "Data from 2023-2024 NCAA Basketball Regular Season") +
  theme_minimal(base_size = 20) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_viridis_d()

This visual shows the performance of college basketball teams in the 2023-2024 regular season. We have selected the 4 teams that have been given the 1 seed in the tournament, Houston, Connecticut, Purdue, and North Carolina. We can see how these 4 teams compare in various statistics. Statistics we have chosen here are 2-point field goal percentage allowed (2P_D), offensive 2-point field goal percentage (2P_O), 3-point field goal percentage allowed (3P_D), offensive 3-point field goal percentage (3P_O), and win percentage (win_perc). The shiny app allows users to change what teams and statistics they want to see in this visual.

Regular Season Statistics Table

cbb24_top_teams |>
  arrange(desc(win_perc)) |>
  kable() |>
  kable_styling(full_width = FALSE,
                font_size = 23)
TEAM win_perc 2P_O 2P_D 3P_O 3P_D
Connecticut 91.17647 58.5 43.7 36.7 31.9
Houston 88.23529 48.4 43.4 34.7 30.0
Purdue 87.87879 53.2 48.1 40.8 31.4
North Carolina 79.41176 50.3 46.0 35.4 31.4

The table shows the performance of college basketball teams in the 2023-2024 regular season. This table accompanies the graph above, and provides numerical values for the statistics shown in the graph.

Public Picks Plot

rounds_order <- c("R64", "R32", "S16", "E8", "F4", "FINALS")

public_picks_top_teams <- public_picks |> 
  filter(TEAM %in% c("Houston", "Connecticut", "Purdue", "North Carolina")) |>
  select(TEAM, S16, E8, F4, FINALS)

public_picks_top_teams_long <- pivot_longer(public_picks_top_teams, 
                                     cols = -TEAM, 
                                     names_to = "Round", 
                                     values_to = "Percentage")

public_picks_top_teams_long$Round <- factor(public_picks_top_teams_long$Round, 
                                            levels = rounds_order)

ggplot(public_picks_top_teams_long, 
       aes(x = Round, y = Percentage, fill = TEAM)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Round", y = "Percentage",
       title = "Public Picks for 2024 NCAA Tournament Teams") +
  theme_minimal(base_size = 20) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_viridis_d()

This visual shows the public picks for the 2024 NCAA tournament. Users of the shiny app can see how the public thinks different teams will perform in the tournament. Again, we have selected the 4 teams that have been given the 1 seed in the tournament, Houston, Connecticut, Purdue, and North Carolina, and can explore the public picks. The shiny app allows users to change what teams they want to see in this visual, by modifying the teams selected and what rounds they want to see.

Public Picks Table

public_picks_top_teams |>
  arrange(desc(FINALS)) |>
  kable() |>
  kable_styling(full_width = FALSE,
                font_size = 23)
TEAM S16 E8 F4 FINALS
Connecticut 80.22 66.20 50.27 34.92
North Carolina 71.16 45.40 29.98 12.10
Purdue 66.25 44.36 21.64 10.22
Houston 63.77 42.54 16.15 9.27

The table shows the public picks for the 2024 NCAA tournament. This table accompanies the graph above, and provides numerical values for the percentages shown in the graph.

Conclusion

In conclusion, we have built a shiny app that allows users to explore the data from the 2023-2024 NCAA basketball season. We have provided tables and visuals that allow users to see how teams performed in the regular season, and how the public thought they would perform in the 2024 NCAA tournament. We can look at various statistics from the regular season data to see if there are any trends that can be seen in the data. We can also looked at the public picks data to see if there are any trends in the data that can be seen. We can see if there are any specific statistics or groupings of statistics that the public tends to pick more often than others.

We hope that this shiny app will be useful for anyone interested in exploring the data from the 2023-2024 NCAA basketball season, and can be easily adapted to explore other seasons of NCAA basketball.

If given more time, we would like to add more features to the shiny app. We would like to add another section of exploring data using the second tidytuesday dataset, team-results, that includes data on previous years of March Madness Tournament Data. We would like to add more visuals and tables that allow users to explore this data. And now that the tournament has completed, it would be interesting to see how the public picks compared to the actual results of the tournament.